v0.5.0¶

Release Date: 2026-05-23

Highlights¶

Likert Scale Support: Add 1-5 numeric scales for human comparison studies
Inter-Rater Reliability: Metrics for comparing LLM and human ratings

Overview¶

v0.5.0 adds dual-scale support, allowing evaluations to capture both categorical (pass/partial/fail) and numeric (1-5 Likert) scores. This enables human comparison studies, inter-rater reliability analysis, and LLM calibration workflows.

New Features¶

Likert Scales¶

Create categories with 1-5 numeric scales:

cat := evaluation.NewCategory("quality", "Content Quality", "Overall quality").
    WithLikert5(evaluation.StandardLikert5Anchors())

Standard anchors: - 5: Excellent - 4: Good - 3: Adequate - 2: Needs Improvement - 1: Poor

Likert scores are automatically mapped to categorical for decisions (4-5=pass, 3=partial, 1-2=fail).

Dual Scores¶

Record both categorical and numeric scores:

// From Likert score
result := evaluation.NewCategoryResultFromLikert("quality", 4, config, "reasoning")

// Explicit dual scores
result := evaluation.NewCategoryResultWithNumeric("quality", evaluation.ScorePass, 4.5, "reasoning")

// Add numeric to existing
result.SetNumericScore(4.5)

Inter-Rater Reliability¶

Compare ratings between evaluators:

metrics := evaluation.ComputeIRRFromResults(humanResults, llmResults)

fmt.Printf("Exact Agreement: %.1f%%\n", metrics.ExactAgreement*100)
fmt.Printf("Pearson r: %.3f\n", metrics.PearsonCorrelation)

Available metrics: - Exact agreement percentage - Adjacent agreement (within ±1) - Mean absolute difference - Pearson correlation

Categorical agreement with confusion matrix:

agreement := evaluation.ComputeCategoricalAgreement(humanResults, llmResults)
// Analyze disagreement patterns in agreement.ConfusionMatrix

Migration Guide¶

v0.5.0 is fully backward compatible with v0.4.0. No migration required.

New features are additive: - NumericScore field is optional on CategoryResult - Likert scale type is a new option alongside existing categorical/binary/checklist - IRR functions are new utilities

Documentation¶

Likert Scales - Creating and using Likert scales
Inter-Rater Reliability - Computing IRR metrics

Full Changelog¶

See CHANGELOG.md for the complete list of changes.