Evaluate
Go beyond simple output comparison. AgentHealth evaluates the entire agent trajectory—every thought, tool call, and decision.
You can run evaluations three ways:
- UI — Interactive dashboard for visual exploration
- SDK — Programmatic access for custom scripts
- CLI — Command-line interface for CI/CD workflows
Run evaluations visually through the AgentHealth dashboard.
Watch the demo: AgentHealth Evaluation Walkthrough
- Open AgentHealth UI:
npx @opensearch-project/agent-health@latestathttp://localhost:4001 - Navigate to Benchmarks
- Select test cases and click Run
- View results with trajectory visualization
Run evaluations programmatically for custom scripts and workflows.
1. Create Connector
Section titled “1. Create Connector”Connect to your agent:
from opensearch_agenthealth import AgentConnector
connector = AgentConnector( name="my-agent", endpoint="http://localhost:3000", protocol="ag-ui")2. Create Test Cases and Benchmarks
Section titled “2. Create Test Cases and Benchmarks”Define scenarios and group them:
from opensearch_agenthealth import TestCase, Benchmark
# Create test casestest_case = TestCase( name="Database Timeout Diagnosis", initial_prompt="Why is my API returning 503 errors?", context=[ {"type": "log", "content": "Connection pool exhausted..."}, ], expected_outcomes=[ "Identify connection pool exhaustion", "Recommend increasing pool size" ])
# Group into benchmarkbenchmark = Benchmark( name="RCA Scenarios", test_cases=[test_case])3. Evaluate Agent
Section titled “3. Evaluate Agent”Run the benchmark against your agent:
from opensearch_agenthealth import Evaluator
evaluator = Evaluator(judge_model="claude-sonnet-4")
result = evaluator.run( benchmark=benchmark, connector=connector)
print(f"Accuracy: {result.accuracy}/100")print(f"Reasoning: {result.judge_reasoning}")4. Compare Runs
Section titled “4. Compare Runs”Compare results across different agents or configurations:
comparison = evaluator.compare_runs( benchmark=benchmark, connectors=[ AgentConnector(name="GPT-4o", endpoint="..."), AgentConnector(name="Claude Sonnet", endpoint="..."), ])
for run in comparison.runs: print(f"{run.name}: {run.accuracy}/100")Run evaluations from your terminal or CI/CD pipelines:
# Run all test casesnpx agent-health test
# Run specific test filenpx agent-health test tests/rca-scenarios.ts
# Run with specific reporternpx agent-health test --reporter=jsonGitHub Actions example:
- name: Run Agent Evaluations run: npx agent-health test --reporter=github