Skip to main content

Datasets Overview

Datasets are collections of test cases used to evaluate AI agent performance. Unlike traditional unit tests, AgentHealth datasets capture complex scenarios with context, tools, and expected behaviors.

A dataset in AgentHealth consists of:

  • Test Cases: Individual evaluation scenarios
  • Context Items: Supporting data (logs, metrics, docs)
  • Expected Outcomes: Behavioral expectations for the agent
  • Labels: Organization and filtering metadata
Dataset: "RCA Evaluation Suite"
├── Test Case: "Database Connection Timeout"
│ ├── Context: Error logs, metrics data
│ ├── Tools: log_fetcher, metrics_query
│ └── Expected: Identify connection pool issue
├── Test Case: "Memory Leak Detection"
│ ├── Context: Memory graphs, heap dumps
│ ├── Tools: metrics_query, heap_analyzer
│ └── Expected: Identify leak source
└── Test Case: "Network Latency Spike"
├── Context: Network traces, latency data
├── Tools: trace_analyzer, ping_test
└── Expected: Identify routing issue

Each test case defines a complete evaluation scenario:

interface TestCase {
// Identification
id: string;
name: string;
description?: string;
// Evaluation Input
initialPrompt: string; // The question/task for the agent
context: AgentContextItem[]; // Supporting information
tools: ToolDefinition[]; // Available tools
// Expected Behavior
expectedOutcomes: string[]; // What should happen
// Organization
labels: Label[]; // Tags for filtering
version: number; // Immutable versioning
}

Context provides information the agent needs to complete the task:

{
type: "log",
source: "api-server",
content: `
2024-01-15 10:23:45 ERROR Connection timeout to database
2024-01-15 10:23:46 ERROR Failed to acquire connection from pool
2024-01-15 10:23:47 WARN Connection pool exhausted (50/50)
`,
metadata: {
timeRange: "last 5 minutes",
level: "ERROR"
}
}
{
type: "metric",
source: "prometheus",
content: {
name: "db_connection_pool_size",
values: [
{ timestamp: "10:20:00", value: 45 },
{ timestamp: "10:21:00", value: 48 },
{ timestamp: "10:22:00", value: 50 },
{ timestamp: "10:23:00", value: 50 }
]
}
}
{
type: "document",
source: "runbook",
content: `
## Database Connection Pool Troubleshooting
If connection pool is exhausted:
1. Check for connection leaks in application code
2. Increase pool size in config (max 100)
3. Review query execution times
`
}

Expected outcomes describe behavior, not exact outputs:

# Good: Behavioral expectations
expected_outcomes:
- "Agent should analyze the error logs first"
- "Agent should query connection pool metrics"
- "Agent should identify pool exhaustion as root cause"
- "Agent should recommend increasing pool size"
- "Agent should NOT make more than 5 tool calls"
# Bad: Exact output matching
expected_outcomes:
- "The database connection pool is exhausted" # Too specific

Tip: Write expected outcomes like you’re describing the ideal agent behavior to a colleague. Focus on what should happen, not the exact words.

Test cases are stored in the evals_test_cases index:

{
"id": "tc-001",
"version": 3,
"name": "Database Connection Timeout",
"initialPrompt": "Why is my API returning 503 errors?",
"expectedOutcomes": ["..."],
"labels": [
{ "key": "category", "value": "RCA" },
{ "key": "difficulty", "value": "Medium" }
],
"createdAt": "2024-01-15T10:00:00Z",
"updatedAt": "2024-01-20T14:30:00Z"
}

Every edit creates a new version, preserving history:

tc-001-v1 → tc-001-v2 → tc-001-v3 (current)

This ensures:

  • Reproducibility: Re-run evaluations with exact same test case
  • Audit Trail: Track changes over time
  • Comparison: Compare results across versions

The unified labeling system helps organize datasets:

Label KeyValuesPurpose
categoryRCA, Query, ConversationUse case type
difficultyEasy, Medium, HardComplexity level
domainkubernetes, databases, networkingTechnical domain
frameworklanggraph, strands, customTarget framework
priorityP0, P1, P2Regression priority

Add any labels relevant to your organization:

labels: [
{ key: "team", value: "platform" },
{ key: "sprint", value: "2024-Q1" },
{ key: "customer", value: "enterprise" }
]

How AgentHealth datasets compare to other platforms:

FeatureAgentHealthBraintrustArizeLangfuse
StorageOpenSearch (self-hosted)CloudCloudPostgres
VersioningImmutable versionsGit-likeSnapshotsTimestamps
Context TypesLogs, metrics, docs, customJSONSpansJSON
Expected OutcomesBehavioral descriptionsScorersGround truthManual
LabelingUnified key-valueTagsMetadataTags
ExportOpenSearch APIAPIAPIAPI

AgentHealth focuses on trajectory evaluation rather than output matching. This means:

  • Test cases describe expected behavior, not exact outputs
  • LLM Judge evaluates the entire reasoning process
  • Improvement strategies are generated automatically

Step-by-step guide to creating test cases →

Automatically capture production traces as test cases →

Use datasets in benchmark evaluations →