Skip to main content

Evaluate

Go beyond simple output comparison. AgentHealth evaluates the entire agent trajectory—every thought, tool call, and decision.

You can run evaluations three ways:

  • UI — Interactive dashboard for visual exploration
  • SDK — Programmatic access for custom scripts
  • CLI — Command-line interface for CI/CD workflows

Run evaluations visually through the AgentHealth dashboard.

Watch the demo: AgentHealth Evaluation Walkthrough

  1. Open AgentHealth UI: npx @opensearch-project/agent-health@latest at http://localhost:4001
  2. Navigate to Benchmarks
  3. Select test cases and click Run
  4. View results with trajectory visualization

Run evaluations programmatically for custom scripts and workflows.

Connect to your agent:

from opensearch_agenthealth import AgentConnector
connector = AgentConnector(
name="my-agent",
endpoint="http://localhost:3000",
protocol="ag-ui"
)

Define scenarios and group them:

from opensearch_agenthealth import TestCase, Benchmark
# Create test cases
test_case = TestCase(
name="Database Timeout Diagnosis",
initial_prompt="Why is my API returning 503 errors?",
context=[
{"type": "log", "content": "Connection pool exhausted..."},
],
expected_outcomes=[
"Identify connection pool exhaustion",
"Recommend increasing pool size"
]
)
# Group into benchmark
benchmark = Benchmark(
name="RCA Scenarios",
test_cases=[test_case]
)

Run the benchmark against your agent:

from opensearch_agenthealth import Evaluator
evaluator = Evaluator(judge_model="claude-sonnet-4")
result = evaluator.run(
benchmark=benchmark,
connector=connector
)
print(f"Accuracy: {result.accuracy}/100")
print(f"Reasoning: {result.judge_reasoning}")

Compare results across different agents or configurations:

comparison = evaluator.compare_runs(
benchmark=benchmark,
connectors=[
AgentConnector(name="GPT-4o", endpoint="..."),
AgentConnector(name="Claude Sonnet", endpoint="..."),
]
)
for run in comparison.runs:
print(f"{run.name}: {run.accuracy}/100")

Run evaluations from your terminal or CI/CD pipelines:

Terminal window
# Run all test cases
npx agent-health test
# Run specific test file
npx agent-health test tests/rca-scenarios.ts
# Run with specific reporter
npx agent-health test --reporter=json

GitHub Actions example:

- name: Run Agent Evaluations
run: npx agent-health test --reporter=github

Full Evaluation Guide →