Datasets Overview
Datasets are collections of test cases used to evaluate AI agent performance. Unlike traditional unit tests, AgentHealth datasets capture complex scenarios with context, tools, and expected behaviors.
What is a Dataset?
Section titled “What is a Dataset?”A dataset in AgentHealth consists of:
- Test Cases: Individual evaluation scenarios
- Context Items: Supporting data (logs, metrics, docs)
- Expected Outcomes: Behavioral expectations for the agent
- Labels: Organization and filtering metadata
Dataset: "RCA Evaluation Suite"├── Test Case: "Database Connection Timeout"│ ├── Context: Error logs, metrics data│ ├── Tools: log_fetcher, metrics_query│ └── Expected: Identify connection pool issue├── Test Case: "Memory Leak Detection"│ ├── Context: Memory graphs, heap dumps│ ├── Tools: metrics_query, heap_analyzer│ └── Expected: Identify leak source└── Test Case: "Network Latency Spike" ├── Context: Network traces, latency data ├── Tools: trace_analyzer, ping_test └── Expected: Identify routing issueTest Case Structure
Section titled “Test Case Structure”Each test case defines a complete evaluation scenario:
interface TestCase { // Identification id: string; name: string; description?: string;
// Evaluation Input initialPrompt: string; // The question/task for the agent context: AgentContextItem[]; // Supporting information tools: ToolDefinition[]; // Available tools
// Expected Behavior expectedOutcomes: string[]; // What should happen
// Organization labels: Label[]; // Tags for filtering version: number; // Immutable versioning}Context Items
Section titled “Context Items”Context provides information the agent needs to complete the task:
{ type: "log", source: "api-server", content: ` 2024-01-15 10:23:45 ERROR Connection timeout to database 2024-01-15 10:23:46 ERROR Failed to acquire connection from pool 2024-01-15 10:23:47 WARN Connection pool exhausted (50/50) `, metadata: { timeRange: "last 5 minutes", level: "ERROR" }}Metrics
Section titled “Metrics”{ type: "metric", source: "prometheus", content: { name: "db_connection_pool_size", values: [ { timestamp: "10:20:00", value: 45 }, { timestamp: "10:21:00", value: 48 }, { timestamp: "10:22:00", value: 50 }, { timestamp: "10:23:00", value: 50 } ] }}Documentation
Section titled “Documentation”{ type: "document", source: "runbook", content: ` ## Database Connection Pool Troubleshooting
If connection pool is exhausted: 1. Check for connection leaks in application code 2. Increase pool size in config (max 100) 3. Review query execution times `}Expected Outcomes
Section titled “Expected Outcomes”Expected outcomes describe behavior, not exact outputs:
# Good: Behavioral expectationsexpected_outcomes: - "Agent should analyze the error logs first" - "Agent should query connection pool metrics" - "Agent should identify pool exhaustion as root cause" - "Agent should recommend increasing pool size" - "Agent should NOT make more than 5 tool calls"
# Bad: Exact output matchingexpected_outcomes: - "The database connection pool is exhausted" # Too specificTip: Write expected outcomes like you’re describing the ideal agent behavior to a colleague. Focus on what should happen, not the exact words.
Storage and Versioning
Section titled “Storage and Versioning”OpenSearch Storage
Section titled “OpenSearch Storage”Test cases are stored in the evals_test_cases index:
{ "id": "tc-001", "version": 3, "name": "Database Connection Timeout", "initialPrompt": "Why is my API returning 503 errors?", "expectedOutcomes": ["..."], "labels": [ { "key": "category", "value": "RCA" }, { "key": "difficulty", "value": "Medium" } ], "createdAt": "2024-01-15T10:00:00Z", "updatedAt": "2024-01-20T14:30:00Z"}Immutable Versioning
Section titled “Immutable Versioning”Every edit creates a new version, preserving history:
tc-001-v1 → tc-001-v2 → tc-001-v3 (current)This ensures:
- Reproducibility: Re-run evaluations with exact same test case
- Audit Trail: Track changes over time
- Comparison: Compare results across versions
Labels and Organization
Section titled “Labels and Organization”The unified labeling system helps organize datasets:
Standard Labels
Section titled “Standard Labels”| Label Key | Values | Purpose |
|---|---|---|
category | RCA, Query, Conversation | Use case type |
difficulty | Easy, Medium, Hard | Complexity level |
domain | kubernetes, databases, networking | Technical domain |
framework | langgraph, strands, custom | Target framework |
priority | P0, P1, P2 | Regression priority |
Custom Labels
Section titled “Custom Labels”Add any labels relevant to your organization:
labels: [ { key: "team", value: "platform" }, { key: "sprint", value: "2024-Q1" }, { key: "customer", value: "enterprise" }]Comparison with Alternatives
Section titled “Comparison with Alternatives”How AgentHealth datasets compare to other platforms:
| Feature | AgentHealth | Braintrust | Arize | Langfuse |
|---|---|---|---|---|
| Storage | OpenSearch (self-hosted) | Cloud | Cloud | Postgres |
| Versioning | Immutable versions | Git-like | Snapshots | Timestamps |
| Context Types | Logs, metrics, docs, custom | JSON | Spans | JSON |
| Expected Outcomes | Behavioral descriptions | Scorers | Ground truth | Manual |
| Labeling | Unified key-value | Tags | Metadata | Tags |
| Export | OpenSearch API | API | API | API |
Key Differentiator
Section titled “Key Differentiator”AgentHealth focuses on trajectory evaluation rather than output matching. This means:
- Test cases describe expected behavior, not exact outputs
- LLM Judge evaluates the entire reasoning process
- Improvement strategies are generated automatically
Next Steps
Section titled “Next Steps”Create a Dataset
Section titled “Create a Dataset”Step-by-step guide to creating test cases →
Auto-Add to Dataset
Section titled “Auto-Add to Dataset”Automatically capture production traces as test cases →