Datasets Overview

Datasets are collections of test cases used to evaluate AI agent performance. Unlike traditional unit tests, AgentHealth datasets capture complex scenarios with context, tools, and expected behaviors.

What is a Dataset?

A dataset in AgentHealth consists of:

Test Cases: Individual evaluation scenarios
Context Items: Supporting data (logs, metrics, docs)
Expected Outcomes: Behavioral expectations for the agent
Labels: Organization and filtering metadata

Dataset: "RCA Evaluation Suite"
├── Test Case: "Database Connection Timeout"
│   ├── Context: Error logs, metrics data
│   ├── Tools: log_fetcher, metrics_query
│   └── Expected: Identify connection pool issue
├── Test Case: "Memory Leak Detection"
│   ├── Context: Memory graphs, heap dumps
│   ├── Tools: metrics_query, heap_analyzer
│   └── Expected: Identify leak source
└── Test Case: "Network Latency Spike"
    ├── Context: Network traces, latency data
    ├── Tools: trace_analyzer, ping_test
    └── Expected: Identify routing issue

Test Case Structure

Each test case defines a complete evaluation scenario:

interface TestCase {
  // Identification
  id: string;
  name: string;
  description?: string;

  // Evaluation Input
  initialPrompt: string;           // The question/task for the agent
  context: AgentContextItem[];     // Supporting information
  tools: ToolDefinition[];         // Available tools

  // Expected Behavior
  expectedOutcomes: string[];      // What should happen

  // Organization
  labels: Label[];                 // Tags for filtering
  version: number;                 // Immutable versioning
}

Context Items

Context provides information the agent needs to complete the task:

Logs

{
  type: "log",
  source: "api-server",
  content: `
    2024-01-15 10:23:45 ERROR Connection timeout to database
    2024-01-15 10:23:46 ERROR Failed to acquire connection from pool
    2024-01-15 10:23:47 WARN Connection pool exhausted (50/50)
  `,
  metadata: {
    timeRange: "last 5 minutes",
    level: "ERROR"
  }
}

Metrics

{
  type: "metric",
  source: "prometheus",
  content: {
    name: "db_connection_pool_size",
    values: [
      { timestamp: "10:20:00", value: 45 },
      { timestamp: "10:21:00", value: 48 },
      { timestamp: "10:22:00", value: 50 },
      { timestamp: "10:23:00", value: 50 }
    ]
  }
}

Documentation

{
  type: "document",
  source: "runbook",
  content: `
    ## Database Connection Pool Troubleshooting

    If connection pool is exhausted:
    1. Check for connection leaks in application code
    2. Increase pool size in config (max 100)
    3. Review query execution times
  `
}

Expected Outcomes

Expected outcomes describe behavior, not exact outputs:

# Good: Behavioral expectations
expected_outcomes:
  - "Agent should analyze the error logs first"
  - "Agent should query connection pool metrics"
  - "Agent should identify pool exhaustion as root cause"
  - "Agent should recommend increasing pool size"
  - "Agent should NOT make more than 5 tool calls"

# Bad: Exact output matching
expected_outcomes:
  - "The database connection pool is exhausted"  # Too specific

Tip: Write expected outcomes like you’re describing the ideal agent behavior to a colleague. Focus on what should happen, not the exact words.

Storage and Versioning

OpenSearch Storage

Test cases are stored in the evals_test_cases index:

{
  "id": "tc-001",
  "version": 3,
  "name": "Database Connection Timeout",
  "initialPrompt": "Why is my API returning 503 errors?",
  "expectedOutcomes": ["..."],
  "labels": [
    { "key": "category", "value": "RCA" },
    { "key": "difficulty", "value": "Medium" }
  ],
  "createdAt": "2024-01-15T10:00:00Z",
  "updatedAt": "2024-01-20T14:30:00Z"
}

Immutable Versioning

Every edit creates a new version, preserving history:

tc-001-v1  →  tc-001-v2  →  tc-001-v3 (current)

This ensures:

Reproducibility: Re-run evaluations with exact same test case
Audit Trail: Track changes over time
Comparison: Compare results across versions

Labels and Organization

The unified labeling system helps organize datasets:

Standard Labels

Label Key	Values	Purpose
`category`	RCA, Query, Conversation	Use case type
`difficulty`	Easy, Medium, Hard	Complexity level
`domain`	kubernetes, databases, networking	Technical domain
`framework`	langgraph, strands, custom	Target framework
`priority`	P0, P1, P2	Regression priority

Custom Labels

Add any labels relevant to your organization:

labels: [
  { key: "team", value: "platform" },
  { key: "sprint", value: "2024-Q1" },
  { key: "customer", value: "enterprise" }
]

Comparison with Alternatives

How AgentHealth datasets compare to other platforms:

Feature	AgentHealth	Braintrust	Arize	Langfuse
Storage	OpenSearch (self-hosted)	Cloud	Cloud	Postgres
Versioning	Immutable versions	Git-like	Snapshots	Timestamps
Context Types	Logs, metrics, docs, custom	JSON	Spans	JSON
Expected Outcomes	Behavioral descriptions	Scorers	Ground truth	Manual
Labeling	Unified key-value	Tags	Metadata	Tags
Export	OpenSearch API	API	API	API

Key Differentiator

AgentHealth focuses on trajectory evaluation rather than output matching. This means:

Test cases describe expected behavior, not exact outputs
LLM Judge evaluates the entire reasoning process
Improvement strategies are generated automatically

Next Steps