Experiments Overview

Experiments in AgentHealth allow you to systematically evaluate AI agent performance through benchmarks, A/B testing, and regression tracking.

What is an Experiment?

An experiment consists of:

Benchmark: A collection of test cases to evaluate
Run: A point-in-time execution with specific agent/model configuration
Results: Scores, trajectories, and metrics for each test case

Experiment: "Claude vs GPT-4 Comparison"
├── Benchmark: "RCA Evaluation Suite" (10 test cases)
├── Run 1: LangGraph + Claude Sonnet 4
│   └── Results: 90% pass rate, 87 avg accuracy
├── Run 2: LangGraph + GPT-4o
│   └── Results: 85% pass rate, 82 avg accuracy
└── Run 3: Strands + Claude Sonnet 4
    └── Results: 92% pass rate, 89 avg accuracy

Creating a Benchmark

Benchmarks group related test cases for evaluation:

UI

Navigate to Experiments → New Benchmark
Enter benchmark name and description
Select test cases to include
Add labels for organization
Click Create Benchmark

SDK

from opensearch_agentops import Benchmark

benchmark = agentops.create_benchmark(
    name="RCA Evaluation Suite",
    description="Root cause analysis scenarios",
    test_case_ids=[
        "tc-database-timeout",
        "tc-memory-leak",
        "tc-network-latency"
    ],
    labels=[
        {"key": "category", "value": "RCA"},
        {"key": "priority", "value": "P0"}
    ]
)

API

curl -X POST http://localhost:4001/api/benchmarks \
  -H "Content-Type: application/json" \
  -d '{
    "name": "RCA Evaluation Suite",
    "testCaseIds": ["tc-001", "tc-002", "tc-003"],
    "labels": [{"key": "category", "value": "RCA"}]
  }'

Running an Experiment

Configuration Options

Parameter	Description	Example
`agent`	Agent adapter to use	`langgraph`, `strands`, `holmesgpt`
`model`	LLM model to use	`claude-sonnet-4`, `gpt-4o`
`judge_model`	Model for LLM Judge	`claude-sonnet-4` (default)
`timeout`	Max execution time per test	`300000` (5 minutes)

Execution Flow

1. Start Run
   └── Status: pending → running

2. For each Test Case:
   ├── Invoke agent with prompt + context
   ├── Capture trajectory (thoughts, tools, results)
   ├── Send to LLM Judge for scoring
   └── Store results

3. Complete Run
   ├── Aggregate metrics
   └── Status: running → completed

Real-Time Progress

Watch experiment progress in the UI:

Live Status: See which test case is currently running
Streaming Trajectory: Watch agent execution unfold
Incremental Results: Scores appear as each test completes
Cancellation: Stop a run gracefully at any time

Comparing Runs

Side-by-Side Comparison

Compare multiple runs to understand performance differences:

Test Case	Run 1 (Claude)	Run 2 (GPT-4)	Difference
DB Timeout	95/100	88/100	+7
Memory Leak	82/100	85/100	-3
Network Issue	90/100	78/100	+12
Average	89/100	83.7/100	+5.3

Trajectory Diff

Visualize how different agents approached the same problem:

Agent A (Claude)                    Agent B (GPT-4)
─────────────────                   ─────────────────
1. Analyze error logs               1. Analyze error logs
2. Query metrics                    2. Check documentation
3. Identify root cause              3. Query metrics (extra step)
4. Recommend fix                    4. Query more metrics
                                    5. Identify root cause
                                    6. Recommend fix

Comparison Insights

The system generates automatic insights:

“Claude completed tasks with 30% fewer tool calls”
“GPT-4 showed better performance on Kubernetes scenarios”
“Hard difficulty cases show largest performance gap”

Regression Detection

Track performance over time to catch regressions:

Setting Baselines

# Mark a run as the baseline for comparison
agentops.set_baseline(
    benchmark_id="bench-001",
    run_id="run-baseline-v1"
)

Regression Alerts

Configure alerts when metrics drop:

regression_config:
  benchmark: "production-suite"
  thresholds:
    pass_rate_drop: 5%      # Alert if pass rate drops by 5%
    accuracy_drop: 10       # Alert if accuracy drops by 10 points
    latency_increase: 20%   # Alert if latency increases by 20%

Open Source Benchmarks

AgentHealth integrates with established evaluation frameworks:

Anthropic Bloom

Automated behavioral evaluation for AI safety:

# Import Bloom benchmark
bloom_benchmark = agentops.import_benchmark(
    source="anthropic/bloom",
    behaviors=["sycophancy", "self-preservation"]
)

# Run evaluation
run = agentops.run_benchmark(
    benchmark_id=bloom_benchmark.id,
    agent="my-agent",
    model="claude-sonnet-4"
)

Bloom Behaviors Tested:

Delusional sycophancy: Excessive agreement with user
Self-preservation: Resisting modification/deactivation
Self-preferential bias: Favoring itself in decisions
Instructed sabotage: Following harmful directives

HolmesGPT Benchmarks

150+ evaluation scenarios for RCA agents:

# Import HolmesGPT scenarios
holmes_benchmark = agentops.import_benchmark(
    source="holmesgpt/evaluations",
    difficulty=["easy", "medium"],
    categories=["kubernetes", "prometheus"]
)

Test Categories:

Regression tests (critical, must pass)
Difficulty levels (easy, medium, hard)
Specialized (logs, Kubernetes, Prometheus)

Metrics and Analytics

Run Metrics

Metric	Description
`pass_rate`	Percentage of test cases passed
`avg_accuracy`	Average accuracy score (0-100)
`avg_faithfulness`	Average faithfulness score
`avg_trajectory_alignment`	Average alignment score
`total_tokens`	Total tokens consumed
`total_cost`	Estimated API cost (USD)
`avg_latency`	Average execution time

Aggregate Analytics

Track trends across multiple runs:

# Get benchmark analytics
analytics = agentops.get_analytics(
    benchmark_id="bench-001",
    time_range="last_30_days"
)

print(f"Pass rate trend: {analytics.pass_rate_trend}")
print(f"Best performing model: {analytics.best_model}")
print(f"Most failed test case: {analytics.most_failed}")

Next Steps