Core Concepts

This guide covers the fundamental concepts behind OpenSearch AgentHealth—test cases, benchmarks, traces, and trajectories—that you’ll encounter when building observable AI agents.

Test Cases

A test case defines a scenario to evaluate your agent. Each test case includes:

Initial prompt: The question or task for the agent
Context: Supporting data like logs, metrics, or documentation
Available tools: What the agent can use
Expected outcomes: Descriptions of expected behavior

Context Items

Context provides the agent with information it needs:

Logs: Application logs, error messages
Metrics: Performance data, dashboards
Documentation: Runbooks, architecture docs
Previous Conversations: Multi-turn context

Expected Outcomes

Unlike exact match testing, expected outcomes describe what should happen:

- Agent should identify the database connection timeout
- Agent should query the metrics service for latency data
- Agent should NOT make more than 3 tool calls
- Agent should recommend increasing connection pool size

The AI Judge evaluates whether the agent’s actual trajectory aligns with these expectations.

Benchmarks

A benchmark is a collection of test cases grouped for evaluation. Use benchmarks to:

Group related test cases (e.g., “RCA Scenarios”, “Query Generation”)
Create regression test suites
Compare agent performance across versions

Traces and Spans

Traces are the foundation of observability. A trace represents a complete execution of your agent, from the initial prompt to the final response.

Note: OpenSearch AgentHealth is built on OpenTelemetry, making it compatible with any OTEL-instrumented agent framework.

Span Categories

Each span in a trace is automatically categorized:

Category	Description
AGENT	Root agent operation
LLM	Language model inference
TOOL	Tool/function invocation
RETRIEVAL	Vector/data retrieval

Key Attributes

OpenTelemetry GenAI semantic conventions capture:

Operation name (e.g., “invoke_agent”)
System (e.g., “claude”, “bedrock”)
Model ID
Tool being invoked
Input/output token counts

Trajectories

A trajectory is the sequence of steps an agent takes to complete a task. Unlike simple output comparison, trajectory analysis captures:

Thoughts: The agent’s reasoning process
Actions: Decisions to use tools or generate responses
Tool Calls: Which tools were invoked and with what parameters
Results: Outcomes from each tool invocation

Why Trajectory Matters

Most evaluation tools only check if the agent got the “right answer.” But two agents can reach the same conclusion through very different paths:

Agent A (Optimal)

Analyze error log
Query metrics database
Identify root cause
Recommend fix

Agent B (Inefficient)

Query 5 different databases
Generate speculative hypotheses
Make redundant API calls
Eventually reach same conclusion

Trajectory evaluation reveals these differences, helping you optimize agent behavior.

AI Judge Evaluation

The AI Judge is an AI-powered evaluator that scores agent performance:

Evaluation Metrics

Metric	Range	Description
Accuracy	0-100	Did the agent reach the correct conclusion?
Faithfulness	0-100	Did responses stay grounded in provided context?
Trajectory Alignment	0-100	Did steps match expected flow?
Efficiency	Qualitative	Were tool calls and reasoning optimal?

Improvement Strategies

The judge provides categorized recommendations across:

Tool usage - Optimizing how tools are called
Reasoning - Improving decision-making logic
Context - Better use of available information
Efficiency - Reducing redundant operations

Labels and Organization

Use labels to organize test cases and benchmarks with key-value pairs:

category:RCA - Root cause analysis
difficulty:Easy|Medium|Hard
domain:kubernetes|databases|networking
framework:langgraph|strands|custom

Agent Adapters

AgentHealth supports multiple agent frameworks through adapters:

Adapter	Protocol	Use Case
AG-UI	SSE Streaming	Real-time trajectory capture
Claude Code	CLI + JSON Events	Claude Code CLI agents
LangGraph	REST API	LangGraph deployments
Strands	AWS Bedrock Runtime	Strands framework
Custom	User-defined	Any agent implementation

Data Flow

Understanding how data flows through AgentHealth:

Test Case Created → Stored in OpenSearch
Benchmark Defined → References test cases
Run Executed → Agent invoked, trajectory captured, OTEL spans collected, AI Judge evaluates
Results Stored → Run results indexed
Analysis → Compare runs, track metrics, generate insights