Categorical Scoring¶
As of v0.4.0, structured-evaluation uses categorical scoring instead of numeric scores. This aligns with how LLM judges naturally assess quality.
Score Values¶
const (
ScorePass ScoreValue = "pass" // Meets requirements
ScorePartial ScoreValue = "partial" // Partially meets requirements
ScoreFail ScoreValue = "fail" // Does not meet requirements
)
Why Categorical?¶
Benefits over Numeric Scores¶
- Clearer semantics - "pass" is unambiguous; "7.5" requires interpretation
- Better LLM alignment - LLMs naturally reason in categories
- Simpler aggregation - Majority voting vs. weighted averages
- Reduced bias - No artificial precision (7.2 vs 7.3)
Comparison¶
| Numeric | Categorical | Interpretation |
|---|---|---|
| 8.0-10.0 | pass |
Meets all requirements |
| 5.0-7.9 | partial |
Meets most, minor issues |
| 0.0-4.9 | fail |
Major gaps or issues |
CategoryResult¶
Each evaluation category produces a result:
type CategoryResult struct {
Category string `json:"category"` // Category ID
Score ScoreValue `json:"score"` // pass/partial/fail
Reasoning string `json:"reasoning"` // Explanation
}
Example¶
report.AddCategory(evaluation.CategoryResult{
Category: "problem_definition",
Score: evaluation.ScorePass,
Reasoning: "Problem is clearly stated with measurable business impact",
})
report.AddCategory(evaluation.CategoryResult{
Category: "user_stories",
Score: evaluation.ScorePartial,
Reasoning: "Stories present but 2 of 5 lack acceptance criteria",
})
report.AddCategory(evaluation.CategoryResult{
Category: "success_metrics",
Score: evaluation.ScoreFail,
Reasoning: "No quantitative success metrics defined",
})
CategoryCounts¶
The decision includes category counts for quick assessment:
type CategoryCounts struct {
Pass int `json:"pass"`
Partial int `json:"partial"`
Fail int `json:"fail"`
Total int `json:"total"`
}
Usage¶
counts := report.Decision.CategoryCounts
fmt.Printf("Results: %d pass, %d partial, %d fail (of %d)\n",
counts.Pass, counts.Partial, counts.Fail, counts.Total)
Score Methods¶
score := evaluation.ScorePass
score.IsPassing() // true
score.IsPartial() // false
score.IsFailing() // false
score.Icon() // "🟢"
| Score | Icon | IsPassing | IsPartial | IsFailing |
|---|---|---|---|---|
pass |
🟢 | true | false | false |
partial |
🟡 | false | true | false |
fail |
🔴 | false | false | true |
Decision Logic¶
The overall decision is computed from category results:
// All pass → DecisionPass
// Any fail with blocking findings → DecisionFail
// Mix of pass/partial → DecisionConditional
// Uncertain → DecisionHumanReview
Migration from Numeric¶
If migrating from v0.3.x or earlier:
| Old API | New API |
|---|---|
CategoryScore |
CategoryResult |
Score float64 |
Score ScoreValue |
MaxScore float64 |
(removed) |
Status ScoreStatus |
(merged into Score) |
Justification string |
Reasoning string |
WeightedScore float64 |
(removed from report) |
Next Steps¶
- Pass Criteria - Configure decision thresholds
- Rubrics - Define scoring criteria
- Multi-Judge - Aggregate multiple evaluations