Skip to content

v0.4.0 Release Notes

Release Date: 2026-05-23

Overview

v0.4.0 introduces categorical scoring as a replacement for numeric scores. This is a breaking change that better aligns with how LLM judges naturally assess quality.

Breaking Changes

CategoryScore → CategoryResult

The CategoryScore type has been renamed to CategoryResult with a different structure:

Before (v0.3.x):

type CategoryScore struct {
    Category      string      `json:"category"`
    Weight        float64     `json:"weight"`
    Score         float64     `json:"score"`
    MaxScore      float64     `json:"max_score"`
    Status        ScoreStatus `json:"status"`
    Justification string      `json:"justification"`
}

After (v0.4.0):

type CategoryResult struct {
    Category  string     `json:"category"`
    Score     ScoreValue `json:"score"`     // "pass", "partial", "fail"
    Reasoning string     `json:"reasoning"`
}

ScoreStatus → ScoreValue

Before:

const (
    ScoreStatusPass ScoreStatus = "pass"
    ScoreStatusWarn ScoreStatus = "warn"
    ScoreStatusFail ScoreStatus = "fail"
)

After:

const (
    ScorePass    ScoreValue = "pass"
    ScorePartial ScoreValue = "partial"
    ScoreFail    ScoreValue = "fail"
)

Removed WeightedScore

The WeightedScore field has been removed from EvaluationReport. Category counts are now used instead:

// Before
fmt.Printf("Score: %.1f/10\n", report.WeightedScore)

// After
counts := report.Decision.CategoryCounts
fmt.Printf("Results: %d pass, %d partial, %d fail\n",
    counts.Pass, counts.Partial, counts.Fail)

Migration Guide

Updating Category Creation

Before:

report.AddCategory(evaluation.NewCategoryScore(
    "problem_definition",
    0.20,  // weight
    8.5,   // score
    "Clear problem statement",
))

After:

report.AddCategory(evaluation.CategoryResult{
    Category:  "problem_definition",
    Score:     evaluation.ScorePass,
    Reasoning: "Clear problem statement with measurable goals",
})

Updating Decision Checks

Before:

if report.WeightedScore >= 7.0 {
    // Passed
}

After:

if report.Decision.Passed {
    // Passed
}
// Or check category counts
if report.Decision.CategoryCounts.Fail == 0 {
    // No failing categories
}

Updating Renderers

The render/detailed package has been updated. If you were using numeric scores in custom rendering, update to use categorical values:

// Before
fmt.Printf("%.1f/%.0f", cs.Score, cs.MaxScore)

// After
fmt.Printf("%s", cr.Score) // "pass", "partial", or "fail"

New Features

Terminal Renderer

New ANSI-colored terminal renderer with UTF8 icons:

import "github.com/plexusone/structured-evaluation/render/terminal"

renderer := terminal.New(os.Stdout)
renderer.Render(&report)

Markdown Renderer

New Markdown renderer for documentation:

import "github.com/plexusone/structured-evaluation/render/markdown"

renderer := markdown.New(os.Stdout)
renderer.Render(&report)

CLI Formats

New CLI render formats:

sevaluation render report.json --format=terminal
sevaluation render report.json --format=markdown

Why Categorical Scoring?

  1. Clearer semantics - "pass" is unambiguous; "7.5" requires interpretation
  2. Better LLM alignment - LLMs naturally reason in categories
  3. Simpler aggregation - Majority voting vs. weighted averages
  4. Reduced bias - No artificial precision (7.2 vs 7.3)

Full Changelog

See the CHANGELOG for complete details.