Skip to main content

Experiments Overview

Experiments in AgentHealth allow you to systematically evaluate AI agent performance through benchmarks, A/B testing, and regression tracking.

An experiment consists of:

  • Benchmark: A collection of test cases to evaluate
  • Run: A point-in-time execution with specific agent/model configuration
  • Results: Scores, trajectories, and metrics for each test case
Experiment: "Claude vs GPT-4 Comparison"
├── Benchmark: "RCA Evaluation Suite" (10 test cases)
├── Run 1: LangGraph + Claude Sonnet 4
│ └── Results: 90% pass rate, 87 avg accuracy
├── Run 2: LangGraph + GPT-4o
│ └── Results: 85% pass rate, 82 avg accuracy
└── Run 3: Strands + Claude Sonnet 4
└── Results: 92% pass rate, 89 avg accuracy

Benchmarks group related test cases for evaluation:

  1. Navigate to ExperimentsNew Benchmark
  2. Enter benchmark name and description
  3. Select test cases to include
  4. Add labels for organization
  5. Click Create Benchmark
from opensearch_agentops import Benchmark
benchmark = agentops.create_benchmark(
name="RCA Evaluation Suite",
description="Root cause analysis scenarios",
test_case_ids=[
"tc-database-timeout",
"tc-memory-leak",
"tc-network-latency"
],
labels=[
{"key": "category", "value": "RCA"},
{"key": "priority", "value": "P0"}
]
)
Terminal window
curl -X POST http://localhost:4001/api/benchmarks \
-H "Content-Type: application/json" \
-d '{
"name": "RCA Evaluation Suite",
"testCaseIds": ["tc-001", "tc-002", "tc-003"],
"labels": [{"key": "category", "value": "RCA"}]
}'
ParameterDescriptionExample
agentAgent adapter to uselanggraph, strands, holmesgpt
modelLLM model to useclaude-sonnet-4, gpt-4o
judge_modelModel for LLM Judgeclaude-sonnet-4 (default)
timeoutMax execution time per test300000 (5 minutes)
1. Start Run
└── Status: pending → running
2. For each Test Case:
├── Invoke agent with prompt + context
├── Capture trajectory (thoughts, tools, results)
├── Send to LLM Judge for scoring
└── Store results
3. Complete Run
├── Aggregate metrics
└── Status: running → completed

Watch experiment progress in the UI:

  • Live Status: See which test case is currently running
  • Streaming Trajectory: Watch agent execution unfold
  • Incremental Results: Scores appear as each test completes
  • Cancellation: Stop a run gracefully at any time

Compare multiple runs to understand performance differences:

Test CaseRun 1 (Claude)Run 2 (GPT-4)Difference
DB Timeout95/10088/100+7
Memory Leak82/10085/100-3
Network Issue90/10078/100+12
Average89/10083.7/100+5.3

Visualize how different agents approached the same problem:

Agent A (Claude) Agent B (GPT-4)
───────────────── ─────────────────
1. Analyze error logs 1. Analyze error logs
2. Query metrics 2. Check documentation
3. Identify root cause 3. Query metrics (extra step)
4. Recommend fix 4. Query more metrics
5. Identify root cause
6. Recommend fix

The system generates automatic insights:

  • “Claude completed tasks with 30% fewer tool calls”
  • “GPT-4 showed better performance on Kubernetes scenarios”
  • “Hard difficulty cases show largest performance gap”

Track performance over time to catch regressions:

# Mark a run as the baseline for comparison
agentops.set_baseline(
benchmark_id="bench-001",
run_id="run-baseline-v1"
)

Configure alerts when metrics drop:

regression_config:
benchmark: "production-suite"
thresholds:
pass_rate_drop: 5% # Alert if pass rate drops by 5%
accuracy_drop: 10 # Alert if accuracy drops by 10 points
latency_increase: 20% # Alert if latency increases by 20%

AgentHealth integrates with established evaluation frameworks:

Automated behavioral evaluation for AI safety:

# Import Bloom benchmark
bloom_benchmark = agentops.import_benchmark(
source="anthropic/bloom",
behaviors=["sycophancy", "self-preservation"]
)
# Run evaluation
run = agentops.run_benchmark(
benchmark_id=bloom_benchmark.id,
agent="my-agent",
model="claude-sonnet-4"
)

Bloom Behaviors Tested:

  • Delusional sycophancy: Excessive agreement with user
  • Self-preservation: Resisting modification/deactivation
  • Self-preferential bias: Favoring itself in decisions
  • Instructed sabotage: Following harmful directives

150+ evaluation scenarios for RCA agents:

# Import HolmesGPT scenarios
holmes_benchmark = agentops.import_benchmark(
source="holmesgpt/evaluations",
difficulty=["easy", "medium"],
categories=["kubernetes", "prometheus"]
)

Test Categories:

  • Regression tests (critical, must pass)
  • Difficulty levels (easy, medium, hard)
  • Specialized (logs, Kubernetes, Prometheus)
MetricDescription
pass_ratePercentage of test cases passed
avg_accuracyAverage accuracy score (0-100)
avg_faithfulnessAverage faithfulness score
avg_trajectory_alignmentAverage alignment score
total_tokensTotal tokens consumed
total_costEstimated API cost (USD)
avg_latencyAverage execution time

Track trends across multiple runs:

# Get benchmark analytics
analytics = agentops.get_analytics(
benchmark_id="bench-001",
time_range="last_30_days"
)
print(f"Pass rate trend: {analytics.pass_rate_trend}")
print(f"Best performing model: {analytics.best_model}")
print(f"Most failed test case: {analytics.most_failed}")

Programmatic experiment execution →

Visual experiment management →

Automate in your pipeline →

Analyze and compare runs →