Experiments Overview
Experiments in AgentHealth allow you to systematically evaluate AI agent performance through benchmarks, A/B testing, and regression tracking.
What is an Experiment?
Section titled “What is an Experiment?”An experiment consists of:
- Benchmark: A collection of test cases to evaluate
- Run: A point-in-time execution with specific agent/model configuration
- Results: Scores, trajectories, and metrics for each test case
Experiment: "Claude vs GPT-4 Comparison"├── Benchmark: "RCA Evaluation Suite" (10 test cases)├── Run 1: LangGraph + Claude Sonnet 4│ └── Results: 90% pass rate, 87 avg accuracy├── Run 2: LangGraph + GPT-4o│ └── Results: 85% pass rate, 82 avg accuracy└── Run 3: Strands + Claude Sonnet 4 └── Results: 92% pass rate, 89 avg accuracyCreating a Benchmark
Section titled “Creating a Benchmark”Benchmarks group related test cases for evaluation:
- Navigate to Experiments → New Benchmark
- Enter benchmark name and description
- Select test cases to include
- Add labels for organization
- Click Create Benchmark
from opensearch_agentops import Benchmark
benchmark = agentops.create_benchmark( name="RCA Evaluation Suite", description="Root cause analysis scenarios", test_case_ids=[ "tc-database-timeout", "tc-memory-leak", "tc-network-latency" ], labels=[ {"key": "category", "value": "RCA"}, {"key": "priority", "value": "P0"} ])curl -X POST http://localhost:4001/api/benchmarks \ -H "Content-Type: application/json" \ -d '{ "name": "RCA Evaluation Suite", "testCaseIds": ["tc-001", "tc-002", "tc-003"], "labels": [{"key": "category", "value": "RCA"}] }'Running an Experiment
Section titled “Running an Experiment”Configuration Options
Section titled “Configuration Options”| Parameter | Description | Example |
|---|---|---|
agent | Agent adapter to use | langgraph, strands, holmesgpt |
model | LLM model to use | claude-sonnet-4, gpt-4o |
judge_model | Model for LLM Judge | claude-sonnet-4 (default) |
timeout | Max execution time per test | 300000 (5 minutes) |
Execution Flow
Section titled “Execution Flow”1. Start Run └── Status: pending → running
2. For each Test Case: ├── Invoke agent with prompt + context ├── Capture trajectory (thoughts, tools, results) ├── Send to LLM Judge for scoring └── Store results
3. Complete Run ├── Aggregate metrics └── Status: running → completedReal-Time Progress
Section titled “Real-Time Progress”Watch experiment progress in the UI:
- Live Status: See which test case is currently running
- Streaming Trajectory: Watch agent execution unfold
- Incremental Results: Scores appear as each test completes
- Cancellation: Stop a run gracefully at any time
Comparing Runs
Section titled “Comparing Runs”Side-by-Side Comparison
Section titled “Side-by-Side Comparison”Compare multiple runs to understand performance differences:
| Test Case | Run 1 (Claude) | Run 2 (GPT-4) | Difference |
|---|---|---|---|
| DB Timeout | 95/100 | 88/100 | +7 |
| Memory Leak | 82/100 | 85/100 | -3 |
| Network Issue | 90/100 | 78/100 | +12 |
| Average | 89/100 | 83.7/100 | +5.3 |
Trajectory Diff
Section titled “Trajectory Diff”Visualize how different agents approached the same problem:
Agent A (Claude) Agent B (GPT-4)───────────────── ─────────────────1. Analyze error logs 1. Analyze error logs2. Query metrics 2. Check documentation3. Identify root cause 3. Query metrics (extra step)4. Recommend fix 4. Query more metrics 5. Identify root cause 6. Recommend fixComparison Insights
Section titled “Comparison Insights”The system generates automatic insights:
- “Claude completed tasks with 30% fewer tool calls”
- “GPT-4 showed better performance on Kubernetes scenarios”
- “Hard difficulty cases show largest performance gap”
Regression Detection
Section titled “Regression Detection”Track performance over time to catch regressions:
Setting Baselines
Section titled “Setting Baselines”# Mark a run as the baseline for comparisonagentops.set_baseline( benchmark_id="bench-001", run_id="run-baseline-v1")Regression Alerts
Section titled “Regression Alerts”Configure alerts when metrics drop:
regression_config: benchmark: "production-suite" thresholds: pass_rate_drop: 5% # Alert if pass rate drops by 5% accuracy_drop: 10 # Alert if accuracy drops by 10 points latency_increase: 20% # Alert if latency increases by 20%Open Source Benchmarks
Section titled “Open Source Benchmarks”AgentHealth integrates with established evaluation frameworks:
Anthropic Bloom
Section titled “Anthropic Bloom”Automated behavioral evaluation for AI safety:
# Import Bloom benchmarkbloom_benchmark = agentops.import_benchmark( source="anthropic/bloom", behaviors=["sycophancy", "self-preservation"])
# Run evaluationrun = agentops.run_benchmark( benchmark_id=bloom_benchmark.id, agent="my-agent", model="claude-sonnet-4")Bloom Behaviors Tested:
- Delusional sycophancy: Excessive agreement with user
- Self-preservation: Resisting modification/deactivation
- Self-preferential bias: Favoring itself in decisions
- Instructed sabotage: Following harmful directives
HolmesGPT Benchmarks
Section titled “HolmesGPT Benchmarks”150+ evaluation scenarios for RCA agents:
# Import HolmesGPT scenariosholmes_benchmark = agentops.import_benchmark( source="holmesgpt/evaluations", difficulty=["easy", "medium"], categories=["kubernetes", "prometheus"])Test Categories:
- Regression tests (critical, must pass)
- Difficulty levels (easy, medium, hard)
- Specialized (logs, Kubernetes, Prometheus)
Metrics and Analytics
Section titled “Metrics and Analytics”Run Metrics
Section titled “Run Metrics”| Metric | Description |
|---|---|
pass_rate | Percentage of test cases passed |
avg_accuracy | Average accuracy score (0-100) |
avg_faithfulness | Average faithfulness score |
avg_trajectory_alignment | Average alignment score |
total_tokens | Total tokens consumed |
total_cost | Estimated API cost (USD) |
avg_latency | Average execution time |
Aggregate Analytics
Section titled “Aggregate Analytics”Track trends across multiple runs:
# Get benchmark analyticsanalytics = agentops.get_analytics( benchmark_id="bench-001", time_range="last_30_days")
print(f"Pass rate trend: {analytics.pass_rate_trend}")print(f"Best performing model: {analytics.best_model}")print(f"Most failed test case: {analytics.most_failed}")Next Steps
Section titled “Next Steps”Run via SDK
Section titled “Run via SDK”Programmatic experiment execution →
Run via UI
Section titled “Run via UI”Visual experiment management →