Skip to content

Multi-Judge Aggregation

Combine evaluations from multiple judges (LLMs or humans) to improve reliability and reduce individual bias.

Overview

type MultiJudgeResult struct {
    Evaluations         []EvaluationReport `json:"evaluations"`
    AggregationMethod   AggregationMethod  `json:"aggregation_method"`
    Agreement           float64            `json:"agreement"` // 0-1
    Disagreements       []Disagreement     `json:"disagreements,omitempty"`
    ConsolidatedReport  EvaluationReport   `json:"consolidated_report"`
    ConsolidatedDecision Decision          `json:"consolidated_decision"`
}

Aggregation Methods

const (
    AggregationMajority     AggregationMethod = "majority"     // Most common score wins
    AggregationConservative AggregationMethod = "conservative" // Lowest score wins
    AggregationOptimistic   AggregationMethod = "optimistic"   // Highest score wins
)

Majority Voting

Uses the most common score across judges:

// 3 judges: pass, pass, partial → pass
// 3 judges: pass, partial, fail → no clear majority, use tie-breaker

Conservative

Takes the lowest (most pessimistic) score:

// 3 judges: pass, pass, partial → partial
// Useful for security-critical evaluations

Optimistic

Takes the highest score:

// 3 judges: pass, partial, partial → pass
// Useful when any judge passing is sufficient

Aggregating Evaluations

evaluations := []evaluation.EvaluationReport{
    judge1Report,
    judge2Report,
    judge3Report,
}

result := evaluation.AggregateEvaluations(
    evaluations,
    evaluation.AggregationMajority,
)

// Access results
fmt.Printf("Agreement: %.1f%%\n", result.Agreement*100)
fmt.Printf("Decision: %s\n", result.ConsolidatedDecision.Status)

// Check disagreements
for _, d := range result.Disagreements {
    fmt.Printf("Disagreement on %s: %v\n", d.Category, d.Scores)
}

Agreement Metrics

Agreement measures how often judges agree:

type Agreement struct {
    Overall    float64            `json:"overall"`    // 0-1, across all categories
    ByCategory map[string]float64 `json:"by_category"` // Per-category agreement
}

Computing Agreement

// Perfect agreement: all judges give same score → 1.0
// No agreement: all different scores → 0.0
// Partial: some agree → 0.33-0.66

// Fleiss' Kappa is used for >2 judges

Disagreements

When judges disagree significantly:

type Disagreement struct {
    Category string       `json:"category"`
    Scores   []ScoreValue `json:"scores"`     // What each judge scored
    Spread   float64      `json:"spread"`     // How much they disagree
    Resolved ScoreValue   `json:"resolved"`   // Final aggregated score
}

Handling Disagreements

for _, d := range result.Disagreements {
    if d.Spread > 0.5 {
        // Major disagreement - may need human review
        fmt.Printf("⚠️ Major disagreement on %s\n", d.Category)
    }
}

Finding Consolidation

Findings from multiple judges are merged:

// Same finding from multiple judges → deduplicated
// Different severity → highest severity used
// Different details → combined

Example: Three-Judge Panel

// Run evaluation with 3 different LLMs
judges := []JudgeConfig{
    {Model: "claude-3-opus", Provider: "anthropic"},
    {Model: "gpt-4", Provider: "openai"},
    {Model: "gemini-pro", Provider: "google"},
}

var evaluations []evaluation.EvaluationReport
for _, judge := range judges {
    report := runEvaluation(document, rubric, judge)
    evaluations = append(evaluations, report)
}

// Aggregate with majority voting
result := evaluation.AggregateEvaluations(evaluations, evaluation.AggregationMajority)

// Use consolidated report
if result.Agreement < 0.6 {
    // Low agreement - flag for human review
    result.ConsolidatedDecision.Status = evaluation.DecisionHumanReview
}

Best Practices

Judge Selection

  • Use diverse judges (different models, providers)
  • Consider cost/latency tradeoffs
  • Include at least 3 judges for majority voting

When to Use Multi-Judge

  • High-stakes evaluations (security, compliance)
  • Calibration of new rubrics
  • Contentious or subjective assessments
  • Building evaluation datasets

Handling Low Agreement

if result.Agreement < 0.5 {
    // Options:
    // 1. Flag for human review
    // 2. Use conservative aggregation
    // 3. Add more judges
    // 4. Refine rubric criteria
}

Cost Optimization

// Start with single judge
report := runEvaluation(doc, rubric, primaryJudge)

// Only use multi-judge for borderline cases
if report.Decision.Status == evaluation.DecisionConditional {
    reports := runMultiJudge(doc, rubric, allJudges)
    result := AggregateEvaluations(reports, AggregationMajority)
}

Next Steps