Skip to main content

Evaluation Overview

Evaluation is at the heart of building reliable AI agents. OpenSearch AgentHealth provides a complete evaluation framework that goes beyond simple output comparison to analyze the entire agent trajectory.

Traditional LLM evaluation focuses on whether the model produced the β€œcorrect” output. But for AI agents, the path matters as much as the destination:

  • Did the agent answer correctly? βœ“/βœ—
  • Limited insight into why failures occur
  • Can’t detect inefficient but correct behavior
  • No visibility into reasoning quality
  • Full analysis of every step
  • Identifies reasoning breakdowns
  • Detects inefficient tool usage
  • Provides actionable improvements
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Evaluation Framework β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Datasets β”‚ Experiments β”‚ Analysis β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ Test Cases β”‚ β€’ Benchmarks β”‚ β€’ LLM Judge Scoring β”‚
β”‚ β€’ Context Data β”‚ β€’ Runs β”‚ β€’ Trajectory Comparison β”‚
β”‚ β€’ Expected β”‚ β€’ A/B Testing β”‚ β€’ Regression Detection β”‚
β”‚ Outcomes β”‚ β€’ CI/CD β”‚ β€’ Improvement Strategies β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The Golden Path approach captures expected agent behavior, then scores actual performance:

Instead of exact output matching, describe what should happen:

test_case:
name: "Database Connection Timeout"
initial_prompt: "Why is my API returning 503 errors?"
expected_outcomes:
- "Agent should examine the error logs"
- "Agent should check database connection metrics"
- "Agent should identify connection pool exhaustion"
- "Agent should recommend increasing pool size"
- "Agent should NOT make more than 4 tool calls"

During evaluation, every agent step is recorded:

trajectory: [
{ type: "thought", content: "Let me check the error logs first..." },
{ type: "tool_call", tool: "fetch_logs", args: { service: "api" } },
{ type: "tool_result", result: "Connection timeout errors detected" },
{ type: "thought", content: "Database connectivity issue. Checking metrics..." },
{ type: "tool_call", tool: "query_metrics", args: { metric: "db_connections" } },
// ...
]

An LLM (Claude, GPT-4, or local models) compares the trajectory against expected outcomes:

MetricScoreReasoning
Accuracy92/100Correctly identified root cause
Faithfulness88/100Minor speculation in step 3
Trajectory Alignment95/100Followed expected diagnostic flow
Efficiency85/100One redundant API call
1. Create Test Cases
└── UI: Use Cases β†’ New Use Case
2. Build Benchmark
└── UI: Experiments β†’ New Benchmark
└── Select test cases to include
3. Run Experiment
└── UI: Select agent + model β†’ Run
└── Watch progress in real-time
4. Analyze Results
└── UI: View scores, compare runs
└── Drill into individual trajectories
from opensearch_agentops import AgentHealth, TestCase, Benchmark
# Create test case
test_case = TestCase(
name="API Error Diagnosis",
initial_prompt="Why is my API returning 503 errors?",
expected_outcomes=[
"Agent should examine error logs",
"Agent should identify root cause"
]
)
agentops.create_test_case(test_case)
# Create benchmark
benchmark = Benchmark(
name="RCA Evaluation Suite",
test_case_ids=[test_case.id]
)
agentops.create_benchmark(benchmark)
# Run evaluation
run = agentops.run_benchmark(
benchmark_id=benchmark.id,
agent="langgraph",
model="claude-sonnet-4"
)
# Analyze results
print(f"Pass Rate: {run.metrics.pass_rate}%")
print(f"Avg Accuracy: {run.metrics.avg_accuracy}")
.github/workflows/eval.yml
name: Agent Evaluation
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Evaluation
run: |
npx @opensearch-project/agent-health eval \
--benchmark "regression-suite" \
--agent langgraph \
--model claude-sonnet-4
- name: Check Results
run: |
npx @opensearch-project/agent-health check \
--min-pass-rate 90 \
--min-accuracy 85

Does the agent produce correct results?

  • Accuracy: Correctness of final answer
  • Faithfulness: Grounding in provided context
  • Completeness: All required information covered

Does the agent behave as expected?

  • Trajectory Alignment: Steps match expected flow
  • Tool Usage: Correct tools with correct parameters
  • Reasoning Quality: Logical thought progression

Is the agent optimal?

  • Token Usage: Input/output token counts
  • Tool Call Count: Number of external calls
  • Latency: End-to-end execution time
  • Cost: Estimated API costs

AgentHealth supports integration with established benchmarks:

Automated behavioral evaluation for frontier AI models:

  • Sycophancy detection: Does the agent agree excessively?
  • Self-preservation: Does the agent resist modification?
  • Instruction following: Multi-step task completion
  • Sabotage detection: Harmful behavior patterns

Learn more about Bloom integration β†’

150+ evaluation scenarios for root cause analysis agents:

  • Regression tests: Critical scenarios that must pass
  • Difficulty levels: Easy, Medium, Hard scenarios
  • Specialized tests: Kubernetes, Prometheus, logs

Configure HolmesGPT benchmarks β†’

Create and manage test cases β†’

Run benchmarks and analyze results β†’

Automate evaluation in your pipeline β†’