Evaluation Overview
Evaluation is at the heart of building reliable AI agents. OpenSearch AgentHealth provides a complete evaluation framework that goes beyond simple output comparison to analyze the entire agent trajectory.
Why Trajectory-First Evaluation?
Section titled βWhy Trajectory-First Evaluation?βTraditional LLM evaluation focuses on whether the model produced the βcorrectβ output. But for AI agents, the path matters as much as the destination:
Output-Only Evaluation
Section titled βOutput-Only Evaluationβ- Did the agent answer correctly? β/β
- Limited insight into why failures occur
- Canβt detect inefficient but correct behavior
- No visibility into reasoning quality
Trajectory Evaluation
Section titled βTrajectory Evaluationβ- Full analysis of every step
- Identifies reasoning breakdowns
- Detects inefficient tool usage
- Provides actionable improvements
Evaluation Components
Section titled βEvaluation Componentsβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Evaluation Framework ββββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββββββββββββββ€β Datasets β Experiments β Analysis ββββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββββββββββββββ€β β’ Test Cases β β’ Benchmarks β β’ LLM Judge Scoring ββ β’ Context Data β β’ Runs β β’ Trajectory Comparison ββ β’ Expected β β’ A/B Testing β β’ Regression Detection ββ Outcomes β β’ CI/CD β β’ Improvement Strategies ββββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββββββββββββββGolden Path Methodology
Section titled βGolden Path MethodologyβThe Golden Path approach captures expected agent behavior, then scores actual performance:
1. Define Expected Outcomes
Section titled β1. Define Expected OutcomesβInstead of exact output matching, describe what should happen:
test_case: name: "Database Connection Timeout" initial_prompt: "Why is my API returning 503 errors?" expected_outcomes: - "Agent should examine the error logs" - "Agent should check database connection metrics" - "Agent should identify connection pool exhaustion" - "Agent should recommend increasing pool size" - "Agent should NOT make more than 4 tool calls"2. Capture Actual Trajectory
Section titled β2. Capture Actual TrajectoryβDuring evaluation, every agent step is recorded:
trajectory: [ { type: "thought", content: "Let me check the error logs first..." }, { type: "tool_call", tool: "fetch_logs", args: { service: "api" } }, { type: "tool_result", result: "Connection timeout errors detected" }, { type: "thought", content: "Database connectivity issue. Checking metrics..." }, { type: "tool_call", tool: "query_metrics", args: { metric: "db_connections" } }, // ...]3. LLM Judge Evaluation
Section titled β3. LLM Judge EvaluationβAn LLM (Claude, GPT-4, or local models) compares the trajectory against expected outcomes:
| Metric | Score | Reasoning |
|---|---|---|
| Accuracy | 92/100 | Correctly identified root cause |
| Faithfulness | 88/100 | Minor speculation in step 3 |
| Trajectory Alignment | 95/100 | Followed expected diagnostic flow |
| Efficiency | 85/100 | One redundant API call |
Evaluation Workflow
Section titled βEvaluation WorkflowβUI Workflow
Section titled βUI Workflowβ1. Create Test Cases βββ UI: Use Cases β New Use Case
2. Build Benchmark βββ UI: Experiments β New Benchmark βββ Select test cases to include
3. Run Experiment βββ UI: Select agent + model β Run βββ Watch progress in real-time
4. Analyze Results βββ UI: View scores, compare runs βββ Drill into individual trajectoriesSDK Workflow
Section titled βSDK Workflowβfrom opensearch_agentops import AgentHealth, TestCase, Benchmark
# Create test casetest_case = TestCase( name="API Error Diagnosis", initial_prompt="Why is my API returning 503 errors?", expected_outcomes=[ "Agent should examine error logs", "Agent should identify root cause" ])agentops.create_test_case(test_case)
# Create benchmarkbenchmark = Benchmark( name="RCA Evaluation Suite", test_case_ids=[test_case.id])agentops.create_benchmark(benchmark)
# Run evaluationrun = agentops.run_benchmark( benchmark_id=benchmark.id, agent="langgraph", model="claude-sonnet-4")
# Analyze resultsprint(f"Pass Rate: {run.metrics.pass_rate}%")print(f"Avg Accuracy: {run.metrics.avg_accuracy}")CI/CD Integration
Section titled βCI/CD Integrationβname: Agent Evaluationon: [push, pull_request]
jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4
- name: Run Evaluation run: | npx @opensearch-project/agent-health eval \ --benchmark "regression-suite" \ --agent langgraph \ --model claude-sonnet-4
- name: Check Results run: | npx @opensearch-project/agent-health check \ --min-pass-rate 90 \ --min-accuracy 85Supported Evaluation Types
Section titled βSupported Evaluation TypesβFunctional Evaluation
Section titled βFunctional EvaluationβDoes the agent produce correct results?
- Accuracy: Correctness of final answer
- Faithfulness: Grounding in provided context
- Completeness: All required information covered
Behavioral Evaluation
Section titled βBehavioral EvaluationβDoes the agent behave as expected?
- Trajectory Alignment: Steps match expected flow
- Tool Usage: Correct tools with correct parameters
- Reasoning Quality: Logical thought progression
Efficiency Evaluation
Section titled βEfficiency EvaluationβIs the agent optimal?
- Token Usage: Input/output token counts
- Tool Call Count: Number of external calls
- Latency: End-to-end execution time
- Cost: Estimated API costs
Open Source Benchmarks
Section titled βOpen Source BenchmarksβAgentHealth supports integration with established benchmarks:
Anthropic Bloom
Section titled βAnthropic BloomβAutomated behavioral evaluation for frontier AI models:
- Sycophancy detection: Does the agent agree excessively?
- Self-preservation: Does the agent resist modification?
- Instruction following: Multi-step task completion
- Sabotage detection: Harmful behavior patterns
Learn more about Bloom integration β
HolmesGPT Evaluations
Section titled βHolmesGPT Evaluationsβ150+ evaluation scenarios for root cause analysis agents:
- Regression tests: Critical scenarios that must pass
- Difficulty levels: Easy, Medium, Hard scenarios
- Specialized tests: Kubernetes, Prometheus, logs
Configure HolmesGPT benchmarks β
Next Steps
Section titled βNext StepsβDatasets
Section titled βDatasetsβCreate and manage test cases β
Experiments
Section titled βExperimentsβRun benchmarks and analyze results β