Agent Experience (AX) Integration Case Study¶
Executive Summary¶
This case study documents the integration of Agent Experience (AX) principles into opik-go, an LLM observability SDK. The integration enables AI agents to reliably trace their own behavior with machine-readable error handling, explicit retry policies, and pre-flight validation.
Key Results:
- 19 domain-specific error codes defined
- 201 operations mapped with retry policies
- 67 operations have required field definitions
- 7 capability types including streaming and evaluation
Background¶
The Observability Challenge¶
AI agents need to observe their own behavior to learn and improve. This creates a unique challenge: the observability system itself must be agent-friendly.
| Aspect | Human Developer | AI Agent |
|---|---|---|
| Error handling | Read logs, debug | Programmatic recovery |
| Missing traces | Manually investigate | Auto-create resources |
| Evaluation | Review dashboards | Run programmatically |
| Retries | Intuitive judgment | Explicit policies needed |
The Project¶
opik-go is a Go SDK for Comet ML's Opik observability platform:
- 201 API endpoints covering traces, spans, datasets, experiments, and evaluations
- 14,820 line OpenAPI specification
- LLM evaluation framework with heuristic and model-based scorers
- Streaming support for large result sets
- Used by AI agents to observe and improve their own behavior
The Problem¶
When tracing fails, agents face ambiguous errors:
{
"status": 404,
"message": "Not found"
}
Questions the agent cannot answer:
- What wasn't found? — Trace? Span? Dataset? Project?
- Should it retry? — Will it help or create duplicates?
- How to recover? — Create the resource? Use a different one?
- Is it permanent? — Or a transient failure?
Before AX Integration¶
trace, err := client.GetTrace(ctx, traceID)
if err != nil {
if apiErr, ok := err.(*APIError); ok {
if apiErr.StatusCode == 404 {
// Which resource is missing?
// The trace? The project? The workspace?
// No way to determine programmatically
}
}
return nil, err
}
Solution: AX Integration¶
Error Code Design¶
19 domain-specific error codes organized by category:
| Category | Error Codes | HTTP Status |
|---|---|---|
| not_found | TRACE_NOT_FOUND, SPAN_NOT_FOUND, DATASET_NOT_FOUND, EXPERIMENT_NOT_FOUND, PROMPT_NOT_FOUND, PROJECT_NOT_FOUND, FEEDBACK_NOT_FOUND, ATTACHMENT_NOT_FOUND, WORKSPACE_NOT_FOUND, EVALUATOR_NOT_FOUND, ALERT_NOT_FOUND, QUEUE_NOT_FOUND, DASHBOARD_NOT_FOUND | 404 |
| auth | UNAUTHORIZED, FORBIDDEN | 401, 403 |
| validation | INVALID_INPUT | 400 |
| conflict | CONFLICT | 409 |
| rate_limit | RATE_LIMITED | 429 |
| server | INTERNAL_ERROR | 500 |
Retry Policy Mapping¶
All 201 operations mapped with retry safety:
var RetryPolicy = map[string]bool{
// Safe to retry (GET operations)
"getTraceById": true,
"findTraces": true,
"getSpanById": true,
"findDatasets": true,
"streamExperiments": true,
// Not safe (mutations)
"createTrace": false,
"createSpan": false,
"createExperiment": false,
"deleteTraceById": false,
"evaluateTraces": false,
}
Distribution:
| Category | Count | Retryable |
|---|---|---|
| GET (read) | 78 | Yes |
| POST (create) | 62 | No |
| PUT/PATCH (update) | 31 | No |
| DELETE | 30 | No |
Required Fields Extraction¶
67 operations have required field definitions:
var RequiredFields = map[string][]string{
"createTrace": {"name"},
"createSpan": {"trace_id", "name"},
"createExperiment": {"dataset_name", "name"},
"createDataset": {"name"},
"createPrompt": {"name", "template"},
"evaluateTraces": {"trace_ids", "evaluator_ids"},
"evaluateSpans": {"span_ids", "evaluator_ids"},
// ... 60 more
}
Capability Mapping¶
7 capability types for observability operations:
const (
CapRead Capability = "read" // Data retrieval
CapWrite Capability = "write" // Data creation/modification
CapDelete Capability = "delete" // Data removal
CapAdmin Capability = "admin" // Administrative operations
CapStream Capability = "stream" // Streaming responses
CapEvaluate Capability = "evaluate" // LLM evaluation
CapAnalytics Capability = "analytics" // Metrics and BI
)
Results¶
Error Handling Improvement¶
Before:
trace, err := client.GetTrace(ctx, traceID)
if err != nil {
// Generic error handling
log.Printf("Error: %v", err)
return nil, err
}
After:
trace, err := client.GetTrace(ctx, traceID)
if err != nil {
code, ok := opik.GetAXErrorCode(err)
if !ok {
return nil, err
}
switch code {
case ax.ErrTraceNotFound:
// Create the trace first
return client.CreateTrace(ctx, &Trace{ID: traceID, Name: "auto-created"})
case ax.ErrProjectNotFound:
// Create the project, then retry
client.CreateProject(ctx, projectName)
return client.GetTrace(ctx, traceID)
case ax.ErrUnauthorized:
// Re-authenticate
return nil, ErrNeedsAuth
case ax.ErrRateLimited:
// Back off and retry
time.Sleep(time.Second)
return client.GetTrace(ctx, traceID)
}
return nil, err
}
Self-Healing Tracing Pattern¶
func (a *Agent) recordAction(ctx context.Context, action Action) error {
trace := &Trace{
Name: action.Name,
Input: action.Input,
}
err := a.client.CreateTrace(ctx, trace)
if err == nil {
return nil
}
// Self-healing based on AX metadata
info := opik.GetAXErrorInfo(err)
if info == nil {
return err
}
switch info.Category {
case "not_found":
// Create missing resource
if code, _ := opik.GetAXErrorCode(err); code == ax.ErrProjectNotFound {
a.client.CreateProject(ctx, a.projectName)
return a.client.CreateTrace(ctx, trace)
}
case "rate_limit":
// Exponential backoff (retryable)
if info.Retryable {
time.Sleep(time.Second * 2)
return a.recordAction(ctx, action)
}
case "conflict":
// Resource exists, fetch it instead
existing, _ := a.client.GetTrace(ctx, trace.ID)
return a.updateTrace(ctx, existing, action)
}
return err
}
Pre-flight Validation¶
func validateRequest(operationID string, req interface{}) error {
// Extract present fields via reflection or manual mapping
present := extractPresentFields(req)
if msg := ax.ValidateFields(operationID, present); msg != "" {
return fmt.Errorf("validation failed: %s", msg)
}
return nil
}
// Usage
func (c *Client) CreateExperiment(ctx context.Context, req *ExperimentRequest) error {
if err := validateRequest("createExperiment", req); err != nil {
return err // Fail fast without API call
}
return c.api.CreateExperiment(ctx, req)
}
Capability-Based Discovery¶
// Find operations that support streaming
streamOps := ax.GetOperationsByCapability(ax.CapStream)
// ["streamDatasetItems", "streamExperimentItems", "streamExperiments"]
// Check if evaluation is available
if ax.HasCapability("evaluateTraces", ax.CapEvaluate) {
// Run automatic quality evaluation
scores, _ := client.EvaluateTraces(ctx, traceIDs, evaluatorIDs)
}
// Find analytics operations for dashboards
analyticsOps := ax.GetOperationsByCapability(ax.CapAnalytics)
// ["getProjectMetrics", "getProjectStats", "costsSummary", ...]
Metrics¶
Code Changes¶
| Component | Files | Lines |
|---|---|---|
| ax package | 6 new files | ~950 lines |
| errors.go | 1 modified | ~80 lines |
| Total | 7 files | ~1,030 lines |
Coverage¶
| Metadata Type | Count | Coverage |
|---|---|---|
| Error codes | 19 | Domain-complete |
| Retry policies | 201 | 100% of operations |
| Required fields | 67 | 33% (mutation operations) |
| Capabilities | ~100 | Key operations |
Test Results¶
$ go test -v ./ax/...
=== RUN TestIsErrorCode
--- PASS: TestIsErrorCode (0.00s)
=== RUN TestContainsErrorCode
--- PASS: TestContainsErrorCode (0.00s)
=== RUN TestGetErrorInfo
--- PASS: TestGetErrorInfo (0.00s)
=== RUN TestErrorCategoryHelpers
--- PASS: TestErrorCategoryHelpers (0.00s)
=== RUN TestIsRetryable
--- PASS: TestIsRetryable (0.00s)
=== RUN TestRetryableCount
--- PASS: TestRetryableCount (0.00s)
=== RUN TestGetRequiredFields
--- PASS: TestGetRequiredFields (0.00s)
=== RUN TestCapabilities
--- PASS: TestCapabilities (0.00s)
...
PASS
ok github.com/plexusone/opik-go/ax
Key Learnings¶
1. Observability Needs Precision¶
Generic "not found" errors are insufficient when agents trace themselves. Each resource type needs its own error code:
TRACE_NOT_FOUND— The trace doesn't existPROJECT_NOT_FOUND— The project doesn't exist (create it first)SPAN_NOT_FOUND— The span doesn't exist (but trace might)
2. Self-Healing Patterns are Essential¶
Agents that observe themselves must handle their own tracing failures:
// Bad: Agent stops observing itself on first error
// Good: Agent recovers and continues observing
if code == ax.ErrProjectNotFound {
createProject()
retryTrace()
}
3. Domain-Specific Capabilities Matter¶
Observability has unique capabilities that generic CRUD doesn't capture:
- CapEvaluate — LLM quality evaluation
- CapStream — Large result streaming
- CapAnalytics — Metrics and BI operations
4. Retry Policies Prevent Data Corruption¶
Observability data is append-only. Retrying creates risks:
createTrace (not retryable) — Would create duplicate traces
evaluateTraces (not retryable) — Would run evaluation twice
getTraceById (retryable) — Safe to retry reads
5. HTTP Status is Not Enough¶
Status 404 could mean:
- Trace not found
- Span not found
- Project not found
- Workspace not found
- Any of 13 other "not found" conditions
AX error codes disambiguate completely.
Comparison with elevenlabs-go¶
| Aspect | elevenlabs-go | opik-go |
|---|---|---|
| Domain | Voice generation | LLM observability |
| Endpoints | 204 | 201 |
| Error codes | 9 (API discovered) | 19 (domain defined) |
| Retry policies | 236 | 201 |
| Required fields | 72 | 67 |
| Special capabilities | - | Stream, Evaluate, Analytics |
| Self-healing | Media errors | Tracing errors |
Both SDKs benefit from AX, but the specific error codes and capabilities differ by domain.
Future Work¶
API Discovery¶
Run ax-spec discovery against the real Opik API to find additional error codes not documented in the spec.
Idempotency Support¶
Add x-ax-idempotent extension support for safe retry of create operations with idempotency keys.
Batch Operation Handling¶
Define patterns for partial success in batch operations (some items succeed, some fail).
Evaluation Metadata¶
Expose evaluator capabilities (what metrics they produce, what inputs they need).
Conclusion¶
The AX integration transforms opik-go from a basic observability SDK to an agent-friendly one:
| Aspect | Before | After |
|---|---|---|
| Error handling | HTTP status parsing | 19 typed error codes |
| Retry decisions | Hardcoded or missing | 201 operations mapped |
| Validation | Runtime API errors | Pre-flight validation |
| Capabilities | Unknown | Domain-specific discovery |
| Agent behavior | Fragile tracing | Self-healing observability |
For AI agents that need to observe themselves, reliable observability is foundational. AX makes that reliability achievable.