Evaluate

Go beyond simple output comparison. AgentHealth evaluates the entire agent trajectory—every thought, tool call, and decision.

You can run evaluations three ways:

UI — Interactive dashboard for visual exploration
SDK — Programmatic access for custom scripts
CLI — Command-line interface for CI/CD workflows

UI

Run evaluations visually through the AgentHealth dashboard.

Watch the demo: AgentHealth Evaluation Walkthrough

Open AgentHealth UI: npx @opensearch-project/agent-health@latest at http://localhost:4001
Navigate to Benchmarks
Select test cases and click Run
View results with trajectory visualization

SDK

Run evaluations programmatically for custom scripts and workflows.

1. Create Connector

Connect to your agent:

from opensearch_agenthealth import AgentConnector

connector = AgentConnector(
    name="my-agent",
    endpoint="http://localhost:3000",
    protocol="ag-ui"
)

2. Create Test Cases and Benchmarks

Define scenarios and group them:

from opensearch_agenthealth import TestCase, Benchmark

# Create test cases
test_case = TestCase(
    name="Database Timeout Diagnosis",
    initial_prompt="Why is my API returning 503 errors?",
    context=[
        {"type": "log", "content": "Connection pool exhausted..."},
    ],
    expected_outcomes=[
        "Identify connection pool exhaustion",
        "Recommend increasing pool size"
    ]
)

# Group into benchmark
benchmark = Benchmark(
    name="RCA Scenarios",
    test_cases=[test_case]
)

3. Evaluate Agent

Run the benchmark against your agent:

from opensearch_agenthealth import Evaluator

evaluator = Evaluator(judge_model="claude-sonnet-4")

result = evaluator.run(
    benchmark=benchmark,
    connector=connector
)

print(f"Accuracy: {result.accuracy}/100")
print(f"Reasoning: {result.judge_reasoning}")

4. Compare Runs

Compare results across different agents or configurations:

comparison = evaluator.compare_runs(
    benchmark=benchmark,
    connectors=[
        AgentConnector(name="GPT-4o", endpoint="..."),
        AgentConnector(name="Claude Sonnet", endpoint="..."),
    ]
)

for run in comparison.runs:
    print(f"{run.name}: {run.accuracy}/100")

CLI

Run evaluations from your terminal or CI/CD pipelines:

# Run all test cases
npx agent-health test

# Run specific test file
npx agent-health test tests/rca-scenarios.ts

# Run with specific reporter
npx agent-health test --reporter=json

GitHub Actions example:

- name: Run Agent Evaluations
  run: npx agent-health test --reporter=github

Next Steps

Full Evaluation Guide →