Evaluation Overview

Evaluation is at the heart of building reliable AI agents. OpenSearch AgentHealth provides a complete evaluation framework that goes beyond simple output comparison to analyze the entire agent trajectory.

Why Trajectory-First Evaluation?

Traditional LLM evaluation focuses on whether the model produced the “correct” output. But for AI agents, the path matters as much as the destination:

Output-Only Evaluation

Did the agent answer correctly? ✓/✗
Limited insight into why failures occur
Can’t detect inefficient but correct behavior
No visibility into reasoning quality

Trajectory Evaluation

Full analysis of every step
Identifies reasoning breakdowns
Detects inefficient tool usage
Provides actionable improvements

Evaluation Components

┌─────────────────────────────────────────────────────────────────┐
│                    Evaluation Framework                          │
├─────────────────┬─────────────────┬─────────────────────────────┤
│    Datasets     │   Experiments   │         Analysis            │
├─────────────────┼─────────────────┼─────────────────────────────┤
│ • Test Cases    │ • Benchmarks    │ • LLM Judge Scoring         │
│ • Context Data  │ • Runs          │ • Trajectory Comparison     │
│ • Expected      │ • A/B Testing   │ • Regression Detection      │
│   Outcomes      │ • CI/CD         │ • Improvement Strategies    │
└─────────────────┴─────────────────┴─────────────────────────────┘

Golden Path Methodology

The Golden Path approach captures expected agent behavior, then scores actual performance:

1. Define Expected Outcomes

Instead of exact output matching, describe what should happen:

test_case:
  name: "Database Connection Timeout"
  initial_prompt: "Why is my API returning 503 errors?"
  expected_outcomes:
    - "Agent should examine the error logs"
    - "Agent should check database connection metrics"
    - "Agent should identify connection pool exhaustion"
    - "Agent should recommend increasing pool size"
    - "Agent should NOT make more than 4 tool calls"

2. Capture Actual Trajectory

During evaluation, every agent step is recorded:

trajectory: [
  { type: "thought", content: "Let me check the error logs first..." },
  { type: "tool_call", tool: "fetch_logs", args: { service: "api" } },
  { type: "tool_result", result: "Connection timeout errors detected" },
  { type: "thought", content: "Database connectivity issue. Checking metrics..." },
  { type: "tool_call", tool: "query_metrics", args: { metric: "db_connections" } },
  // ...
]

3. LLM Judge Evaluation

An LLM (Claude, GPT-4, or local models) compares the trajectory against expected outcomes:

Metric	Score	Reasoning
Accuracy	92/100	Correctly identified root cause
Faithfulness	88/100	Minor speculation in step 3
Trajectory Alignment	95/100	Followed expected diagnostic flow
Efficiency	85/100	One redundant API call

Evaluation Workflow

UI Workflow

1. Create Test Cases
   └── UI: Use Cases → New Use Case

2. Build Benchmark
   └── UI: Experiments → New Benchmark
   └── Select test cases to include

3. Run Experiment
   └── UI: Select agent + model → Run
   └── Watch progress in real-time

4. Analyze Results
   └── UI: View scores, compare runs
   └── Drill into individual trajectories

SDK Workflow

from opensearch_agentops import AgentHealth, TestCase, Benchmark

# Create test case
test_case = TestCase(
    name="API Error Diagnosis",
    initial_prompt="Why is my API returning 503 errors?",
    expected_outcomes=[
        "Agent should examine error logs",
        "Agent should identify root cause"
    ]
)
agentops.create_test_case(test_case)

# Create benchmark
benchmark = Benchmark(
    name="RCA Evaluation Suite",
    test_case_ids=[test_case.id]
)
agentops.create_benchmark(benchmark)

# Run evaluation
run = agentops.run_benchmark(
    benchmark_id=benchmark.id,
    agent="langgraph",
    model="claude-sonnet-4"
)

# Analyze results
print(f"Pass Rate: {run.metrics.pass_rate}%")
print(f"Avg Accuracy: {run.metrics.avg_accuracy}")

CI/CD Integration

name: Agent Evaluation
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Evaluation
        run: |
          npx @opensearch-project/agent-health eval \
            --benchmark "regression-suite" \
            --agent langgraph \
            --model claude-sonnet-4

      - name: Check Results
        run: |
          npx @opensearch-project/agent-health check \
            --min-pass-rate 90 \
            --min-accuracy 85

Supported Evaluation Types

Functional Evaluation

Does the agent produce correct results?

Accuracy: Correctness of final answer
Faithfulness: Grounding in provided context
Completeness: All required information covered

Behavioral Evaluation

Does the agent behave as expected?

Trajectory Alignment: Steps match expected flow
Tool Usage: Correct tools with correct parameters
Reasoning Quality: Logical thought progression

Efficiency Evaluation

Is the agent optimal?

Token Usage: Input/output token counts
Tool Call Count: Number of external calls
Latency: End-to-end execution time
Cost: Estimated API costs

Open Source Benchmarks

AgentHealth supports integration with established benchmarks:

Anthropic Bloom

Automated behavioral evaluation for frontier AI models:

Sycophancy detection: Does the agent agree excessively?
Self-preservation: Does the agent resist modification?
Instruction following: Multi-step task completion
Sabotage detection: Harmful behavior patterns

Learn more about Bloom integration →

HolmesGPT Evaluations

150+ evaluation scenarios for root cause analysis agents:

Regression tests: Critical scenarios that must pass
Difficulty levels: Easy, Medium, Hard scenarios
Specialized tests: Kubernetes, Prometheus, logs

Configure HolmesGPT benchmarks →

Next Steps