Structured Evaluation¶
Reusable evaluation framework for LLM-as-Judge and multi-agent workflows
Structured Evaluation provides standardized Go types for evaluation reports, enabling consistent quality assessment across LLM-based and deterministic workflows.
Features¶
- ⚖️ LLM-as-Judge Assessments - Categorical scoring (pass/partial/fail) with severity-based findings
- ✅ GO/NO-GO Summary Reports - Deterministic checks for CI, tests, and validation
- 🔗 Multi-Agent Coordination - DAG-based report aggregation using Kahn's algorithm
- 📊 Rubric Definitions - Explicit criteria for consistent evaluations
- 🔄 Pairwise Comparison - Compare outputs instead of absolute scoring
- 👥 Multi-Judge Aggregation - Combine evaluations from multiple judges with agreement metrics
Quick Example¶
package main
import (
"fmt"
"os"
"github.com/plexusone/structured-evaluation/evaluation"
"github.com/plexusone/structured-evaluation/render/terminal"
)
func main() {
report := evaluation.NewEvaluationReport("prd", "document.md")
// Add category results (pass/partial/fail)
report.AddCategory(evaluation.CategoryResult{
Category: "problem_definition",
Score: evaluation.ScorePass,
Reasoning: "Clear problem statement with measurable goals",
})
report.AddCategory(evaluation.CategoryResult{
Category: "user_stories",
Score: evaluation.ScorePartial,
Reasoning: "Stories present but missing acceptance criteria",
})
// Add findings
report.AddFinding(evaluation.Finding{
Severity: evaluation.SeverityMedium,
Category: "metrics",
Title: "Missing baseline metrics",
Recommendation: "Add current baseline measurements",
})
report.Finalize("sevaluation check document.md")
// Render to terminal
renderer := terminal.New(os.Stdout)
renderer.Render(&report)
}
Report Types¶
| Type | Purpose | Use Case |
|---|---|---|
| EvaluationReport | LLM-as-Judge assessments | PRD reviews, code quality, content evaluation |
| SummaryReport | GO/NO-GO deterministic checks | CI pipelines, release validation, test results |
Severity Levels¶
Following InfoSec conventions:
| Severity | Icon | Blocking | Description |
|---|---|---|---|
| Critical | 🔴 | Yes | Must fix before approval |
| High | 🔴 | Yes | Must fix before approval |
| Medium | 🟡 | No | Should fix, tracked |
| Low | 🟢 | No | Nice to fix |
| Info | ⚪ | No | Informational only |
Next Steps¶
- Installation - Get started with structured-evaluation
- Quick Start - Create your first evaluation report
- Concepts - Understand the evaluation model
- CLI - Use the command-line tool