Inter-Rater Reliability (IRR)¶
Structured-evaluation provides metrics for comparing ratings between different evaluators (e.g., LLM vs human, or multiple humans).
Use Cases¶
- LLM Calibration: Compare LLM ratings with human ground truth
- Quality Assurance: Verify LLM evaluations align with expert judgment
- Research: Measure agreement in evaluation studies
- Continuous Improvement: Track LLM alignment over time
Available Metrics¶
IRRMetrics¶
type IRRMetrics struct {
ExactAgreement float64 // % exact matches
AdjacentAgreement float64 // % within ±1
MeanAbsoluteDifference float64 // average |diff|
PearsonCorrelation float64 // -1 to 1
SampleSize int // number of pairs
}
CategoricalAgreement¶
type CategoricalAgreement struct {
ExactAgreement float64 // % exact categorical matches
ConfusionMatrix map[string]int // disagreement patterns
SampleSize int
}
Computing IRR¶
From Rating Pairs¶
pairs := []evaluation.RatingPair{
{Rater1: 5, Rater2: 4, Category: "quality", ItemID: "doc1"},
{Rater1: 3, Rater2: 3, Category: "quality", ItemID: "doc2"},
{Rater1: 4, Rater2: 5, Category: "quality", ItemID: "doc3"},
}
metrics := evaluation.ComputeIRR(pairs)
fmt.Printf("Exact Agreement: %.1f%%\n", metrics.ExactAgreement*100)
fmt.Printf("Adjacent Agreement: %.1f%%\n", metrics.AdjacentAgreement*100)
fmt.Printf("Mean Absolute Diff: %.2f\n", metrics.MeanAbsoluteDifference)
fmt.Printf("Pearson r: %.3f\n", metrics.PearsonCorrelation)
From Category Results¶
Compare two sets of evaluation results directly:
// Human ratings
humanResults := []evaluation.CategoryResult{
*evaluation.NewCategoryResultWithNumeric("quality", evaluation.ScorePass, 5.0, ""),
*evaluation.NewCategoryResultWithNumeric("clarity", evaluation.ScorePartial, 3.0, ""),
}
// LLM ratings
llmResults := []evaluation.CategoryResult{
*evaluation.NewCategoryResultWithNumeric("quality", evaluation.ScorePass, 4.0, ""),
*evaluation.NewCategoryResultWithNumeric("clarity", evaluation.ScorePartial, 3.0, ""),
}
metrics := evaluation.ComputeIRRFromResults(humanResults, llmResults)
Categorical Agreement¶
For pass/partial/fail comparisons:
agreement := evaluation.ComputeCategoricalAgreement(humanResults, llmResults)
fmt.Printf("Exact Agreement: %.1f%%\n", agreement.ExactAgreement*100)
// Analyze disagreement patterns
for pattern, count := range agreement.ConfusionMatrix {
fmt.Printf(" %s: %d\n", pattern, count)
}
// Output: pass:pass: 1, partial:partial: 1
Interpreting Metrics¶
Exact Agreement¶
| Value | Interpretation |
|---|---|
| > 80% | Excellent agreement |
| 60-80% | Good agreement |
| 40-60% | Moderate agreement |
| < 40% | Poor agreement |
Adjacent Agreement¶
| Value | Interpretation |
|---|---|
| > 90% | Excellent (ratings within 1 point) |
| 70-90% | Acceptable |
| < 70% | Needs calibration |
Pearson Correlation¶
| Value | Interpretation |
|---|---|
| > 0.8 | Strong positive correlation |
| 0.5-0.8 | Moderate correlation |
| 0.3-0.5 | Weak correlation |
| < 0.3 | Little to no correlation |
Automatic Score Conversion¶
When comparing categorical-only results, scores are converted:
| Categorical | Numeric |
|---|---|
| Pass | 5.0 |
| Partial | 3.0 |
| Fail | 1.0 |
This allows IRR computation even without explicit numeric scores.
Example: LLM Calibration Workflow¶
// 1. Collect human ground truth
humanResults := runHumanEvaluation(documents)
// 2. Run LLM evaluation
llmResults := runLLMEvaluation(documents)
// 3. Compute IRR
metrics := evaluation.ComputeIRRFromResults(humanResults, llmResults)
catAgreement := evaluation.ComputeCategoricalAgreement(humanResults, llmResults)
// 4. Report
fmt.Println("=== LLM Calibration Report ===")
fmt.Printf("Sample Size: %d evaluations\n", metrics.SampleSize)
fmt.Printf("Exact Agreement: %.1f%%\n", metrics.ExactAgreement*100)
fmt.Printf("Adjacent Agreement: %.1f%%\n", metrics.AdjacentAgreement*100)
fmt.Printf("Correlation: %.3f\n", metrics.PearsonCorrelation)
fmt.Printf("Categorical Agreement: %.1f%%\n", catAgreement.ExactAgreement*100)
// 5. Identify systematic biases
for pattern, count := range catAgreement.ConfusionMatrix {
if strings.Contains(pattern, ":") && pattern[:strings.Index(pattern, ":")] != pattern[strings.Index(pattern, ":")+1:] {
fmt.Printf("Disagreement %s: %d cases\n", pattern, count)
}
}
Best Practices¶
- Collect sufficient samples - Aim for 30+ pairs for reliable metrics
- Use numeric scores when possible - More granular than categorical
- Track over time - Monitor calibration drift
- Analyze confusion patterns - Identify systematic biases (e.g., LLM always harsher)
- Iterate on prompts - Use IRR feedback to improve LLM evaluation prompts