Multi-Judge Aggregation¶

Combine evaluations from multiple judges (LLMs or humans) to improve reliability and reduce individual bias.

Overview¶

type MultiJudgeResult struct {
    Evaluations         []Rubric          `json:"evaluations"`
    AggregationMethod   AggregationMethod `json:"aggregation_method"`
    Agreement           float64           `json:"agreement"` // 0-1
    Disagreements       []Disagreement    `json:"disagreements,omitempty"`
    ConsolidatedReport  Rubric            `json:"consolidated_report"`
    ConsolidatedDecision Decision         `json:"consolidated_decision"`
}

Aggregation Methods¶

const (
    AggregationMajority     AggregationMethod = "majority"     // Most common score wins
    AggregationConservative AggregationMethod = "conservative" // Lowest score wins
    AggregationOptimistic   AggregationMethod = "optimistic"   // Highest score wins
)

Majority Voting¶

Uses the most common score across judges:

// 3 judges: pass, pass, partial → pass
// 3 judges: pass, partial, fail → no clear majority, use tie-breaker

Conservative¶

Takes the lowest (most pessimistic) score:

// 3 judges: pass, pass, partial → partial
// Useful for security-critical evaluations

Optimistic¶

Takes the highest score:

// 3 judges: pass, partial, partial → pass
// Useful when any judge passing is sufficient

Aggregating Evaluations¶

evaluations := []rubric.Rubric{
    judge1Report,
    judge2Report,
    judge3Report,
}

result := rubric.AggregateEvaluations(
    evaluations,
    rubric.AggregationMajority,
)

// Access results
fmt.Printf("Agreement: %.1f%%\n", result.Agreement*100)
fmt.Printf("Decision: %s\n", result.ConsolidatedDecision.Status)

// Check disagreements
for _, d := range result.Disagreements {
    fmt.Printf("Disagreement on %s: %v\n", d.Category, d.Scores)
}

Agreement Metrics¶

Agreement measures how often judges agree:

type Agreement struct {
    Overall    float64            `json:"overall"`    // 0-1, across all categories
    ByCategory map[string]float64 `json:"by_category"` // Per-category agreement
}

Computing Agreement¶

// Perfect agreement: all judges give same score → 1.0
// No agreement: all different scores → 0.0
// Partial: some agree → 0.33-0.66

// Fleiss' Kappa is used for >2 judges

Disagreements¶

When judges disagree significantly:

type Disagreement struct {
    Category string       `json:"category"`
    Scores   []ScoreValue `json:"scores"`     // What each judge scored
    Spread   float64      `json:"spread"`     // How much they disagree
    Resolved ScoreValue   `json:"resolved"`   // Final aggregated score
}

Handling Disagreements¶

for _, d := range result.Disagreements {
    if d.Spread > 0.5 {
        // Major disagreement - may need human review
        fmt.Printf("⚠️ Major disagreement on %s\n", d.Category)
    }
}

Finding Consolidation¶

Findings from multiple judges are merged:

// Same finding from multiple judges → deduplicated
// Different severity → highest severity used
// Different details → combined

Example: Three-Judge Panel¶

// Run evaluation with 3 different LLMs
judges := []JudgeConfig{
    {Model: "claude-3-opus", Provider: "anthropic"},
    {Model: "gpt-4", Provider: "openai"},
    {Model: "gemini-pro", Provider: "google"},
}

var evaluations []rubric.Rubric
for _, judge := range judges {
    report := runEvaluation(document, rubricSet, judge)
    evaluations = append(evaluations, report)
}

// Aggregate with majority voting
result := rubric.AggregateEvaluations(evaluations, rubric.AggregationMajority)

// Use consolidated report
if result.Agreement < 0.6 {
    // Low agreement - flag for human review
    result.ConsolidatedDecision.Status = rubric.DecisionHumanReview
}

Best Practices¶

Judge Selection¶

Use diverse judges (different models, providers)
Consider cost/latency tradeoffs
Include at least 3 judges for majority voting

When to Use Multi-Judge¶

High-stakes evaluations (security, compliance)
Calibration of new rubrics
Contentious or subjective assessments
Building evaluation datasets

Handling Low Agreement¶

if result.Agreement < 0.5 {
    // Options:
    // 1. Flag for human review
    // 2. Use conservative aggregation
    // 3. Add more judges
    // 4. Refine rubric criteria
}

Cost Optimization¶

// Start with single judge
report := runEvaluation(doc, rubricSet, primaryJudge)

// Only use multi-judge for borderline cases
if report.Decision.Status == rubric.DecisionConditional {
    reports := runMultiJudge(doc, rubricSet, allJudges)
    result := AggregateEvaluations(reports, AggregationMajority)
}

Next Steps¶

Rubrics - Define evaluation criteria
Pairwise Comparison - Compare outputs
DAG Aggregation - Multi-agent workflows