Pairwise Comparison¶
Pairwise comparison evaluates two outputs against each other rather than absolute scoring. This is useful when relative quality is more important than absolute scores.
Overview¶
type PairwiseComparison struct {
ID string `json:"id"`
Input string `json:"input"`
OutputA string `json:"output_a"`
OutputB string `json:"output_b"`
Winner Winner `json:"winner"`
Reasoning string `json:"reasoning"`
Confidence float64 `json:"confidence"`
SwapPosition bool `json:"swap_position"` // For bias detection
Judge *JudgeMetadata `json:"judge,omitempty"`
}
const (
WinnerA Winner = "a"
WinnerB Winner = "b"
WinnerTie Winner = "tie"
)
Creating a Comparison¶
comparison := evaluation.NewPairwiseComparison(
"Summarize this article about climate change",
"Output A: Climate change is causing global temperatures...",
"Output B: The article discusses how climate change...",
)
// Set the winner after evaluation
comparison.SetWinner(
evaluation.WinnerA,
"Output A provides more specific details and better structure",
0.85, // Confidence 0-1
)
Position Bias Detection¶
LLMs can exhibit position bias (preferring the first or second option). Detect this by running comparisons with swapped positions:
// Original comparison
comp1 := evaluation.NewPairwiseComparison(input, outputA, outputB)
comp1.SetWinner(evaluation.WinnerA, "...", 0.9)
// Swapped comparison
comp2 := evaluation.NewPairwiseComparison(input, outputB, outputA)
comp2.SwapPosition = true
comp2.SetWinner(evaluation.WinnerB, "...", 0.85) // Should also prefer original A
// Check for consistency
if comp1.Winner == "a" && comp2.Winner == "b" {
// Consistent: both prefer the same output
} else {
// Position bias detected
}
Aggregating Comparisons¶
Compute overall results from multiple comparisons:
comparisons := []evaluation.PairwiseComparison{
comp1, comp2, comp3, // Multiple comparisons
}
result := evaluation.ComputePairwiseResult(comparisons)
fmt.Printf("Win rate A: %.1f%%\n", result.WinRateA*100)
fmt.Printf("Win rate B: %.1f%%\n", result.WinRateB*100)
fmt.Printf("Tie rate: %.1f%%\n", result.TieRate*100)
fmt.Printf("Overall winner: %s\n", result.OverallWinner)
PairwiseResult Structure¶
type PairwiseResult struct {
TotalComparisons int `json:"total_comparisons"`
WinsA int `json:"wins_a"`
WinsB int `json:"wins_b"`
Ties int `json:"ties"`
WinRateA float64 `json:"win_rate_a"`
WinRateB float64 `json:"win_rate_b"`
TieRate float64 `json:"tie_rate"`
OverallWinner Winner `json:"overall_winner"`
PositionBias float64 `json:"position_bias"` // 0 = no bias
}
Use Cases¶
Model Comparison¶
Compare outputs from different models:
// Compare GPT-4 vs Claude responses
comp := evaluation.NewPairwiseComparison(
"Explain quantum computing",
gpt4Response,
claudeResponse,
)
A/B Testing Prompts¶
Compare different prompt strategies:
// Compare concise vs detailed prompts
comp := evaluation.NewPairwiseComparison(
userQuery,
responseFromConcisePrompt,
responseFromDetailedPrompt,
)
Human Preference Alignment¶
Collect human preferences for RLHF:
// Store human preference
comp := evaluation.NewPairwiseComparison(instruction, outputA, outputB)
comp.SetWinner(evaluation.WinnerB, "Preferred by human annotator", 1.0)
comp.Judge = &evaluation.JudgeMetadata{
Model: "human",
Provider: "internal-annotation",
}
Best Practices¶
Randomize Position¶
Always randomize which output appears first to detect bias:
if rand.Float64() < 0.5 {
comp = NewPairwiseComparison(input, outputA, outputB)
} else {
comp = NewPairwiseComparison(input, outputB, outputA)
comp.SwapPosition = true
}
Use Multiple Judges¶
Reduce individual judge bias:
results := []PairwiseComparison{}
for _, judge := range judges {
comp := runComparison(input, outputA, outputB, judge)
results = append(results, comp)
}
aggregate := ComputePairwiseResult(results)
Track Confidence¶
Use confidence scores to weight comparisons:
comp.SetWinner(WinnerA, reasoning, 0.95) // High confidence
comp.SetWinner(WinnerTie, reasoning, 0.55) // Low confidence, essentially a tie
Next Steps¶
- Multi-Judge Aggregation - Combine multiple judges
- Rubrics - Define evaluation criteria