v0.5.0¶
Release Date: 2026-05-23
Highlights¶
- Likert Scale Support: Add 1-5 numeric scales for human comparison studies
- Inter-Rater Reliability: Metrics for comparing LLM and human ratings
Overview¶
v0.5.0 adds dual-scale support, allowing evaluations to capture both categorical (pass/partial/fail) and numeric (1-5 Likert) scores. This enables human comparison studies, inter-rater reliability analysis, and LLM calibration workflows.
New Features¶
Likert Scales¶
Create categories with 1-5 numeric scales:
cat := evaluation.NewCategory("quality", "Content Quality", "Overall quality").
WithLikert5(evaluation.StandardLikert5Anchors())
Standard anchors: - 5: Excellent - 4: Good - 3: Adequate - 2: Needs Improvement - 1: Poor
Likert scores are automatically mapped to categorical for decisions (4-5=pass, 3=partial, 1-2=fail).
Dual Scores¶
Record both categorical and numeric scores:
// From Likert score
result := evaluation.NewCategoryResultFromLikert("quality", 4, config, "reasoning")
// Explicit dual scores
result := evaluation.NewCategoryResultWithNumeric("quality", evaluation.ScorePass, 4.5, "reasoning")
// Add numeric to existing
result.SetNumericScore(4.5)
Inter-Rater Reliability¶
Compare ratings between evaluators:
metrics := evaluation.ComputeIRRFromResults(humanResults, llmResults)
fmt.Printf("Exact Agreement: %.1f%%\n", metrics.ExactAgreement*100)
fmt.Printf("Pearson r: %.3f\n", metrics.PearsonCorrelation)
Available metrics: - Exact agreement percentage - Adjacent agreement (within ±1) - Mean absolute difference - Pearson correlation
Categorical agreement with confusion matrix:
agreement := evaluation.ComputeCategoricalAgreement(humanResults, llmResults)
// Analyze disagreement patterns in agreement.ConfusionMatrix
Migration Guide¶
v0.5.0 is fully backward compatible with v0.4.0. No migration required.
New features are additive:
- NumericScore field is optional on CategoryResult
- Likert scale type is a new option alongside existing categorical/binary/checklist
- IRR functions are new utilities
Documentation¶
- Likert Scales - Creating and using Likert scales
- Inter-Rater Reliability - Computing IRR metrics
Full Changelog¶
See CHANGELOG.md for the complete list of changes.