Skip to main content

Core Concepts

This guide covers the fundamental concepts behind OpenSearch AgentHealth—test cases, benchmarks, traces, and trajectories—that you’ll encounter when building observable AI agents.


A test case defines a scenario to evaluate your agent. Each test case includes:

  • Initial prompt: The question or task for the agent
  • Context: Supporting data like logs, metrics, or documentation
  • Available tools: What the agent can use
  • Expected outcomes: Descriptions of expected behavior

Context provides the agent with information it needs:

  • Logs: Application logs, error messages
  • Metrics: Performance data, dashboards
  • Documentation: Runbooks, architecture docs
  • Previous Conversations: Multi-turn context

Unlike exact match testing, expected outcomes describe what should happen:

- Agent should identify the database connection timeout
- Agent should query the metrics service for latency data
- Agent should NOT make more than 3 tool calls
- Agent should recommend increasing connection pool size

The AI Judge evaluates whether the agent’s actual trajectory aligns with these expectations.


A benchmark is a collection of test cases grouped for evaluation. Use benchmarks to:

  • Group related test cases (e.g., “RCA Scenarios”, “Query Generation”)
  • Create regression test suites
  • Compare agent performance across versions

Traces are the foundation of observability. A trace represents a complete execution of your agent, from the initial prompt to the final response.

Note: OpenSearch AgentHealth is built on OpenTelemetry, making it compatible with any OTEL-instrumented agent framework.

Each span in a trace is automatically categorized:

CategoryDescription
AGENTRoot agent operation
LLMLanguage model inference
TOOLTool/function invocation
RETRIEVALVector/data retrieval

OpenTelemetry GenAI semantic conventions capture:

  • Operation name (e.g., “invoke_agent”)
  • System (e.g., “claude”, “bedrock”)
  • Model ID
  • Tool being invoked
  • Input/output token counts

A trajectory is the sequence of steps an agent takes to complete a task. Unlike simple output comparison, trajectory analysis captures:

  • Thoughts: The agent’s reasoning process
  • Actions: Decisions to use tools or generate responses
  • Tool Calls: Which tools were invoked and with what parameters
  • Results: Outcomes from each tool invocation

Most evaluation tools only check if the agent got the “right answer.” But two agents can reach the same conclusion through very different paths:

Agent A (Optimal)

  1. Analyze error log
  2. Query metrics database
  3. Identify root cause
  4. Recommend fix

Agent B (Inefficient)

  1. Query 5 different databases
  2. Generate speculative hypotheses
  3. Make redundant API calls
  4. Eventually reach same conclusion

Trajectory evaluation reveals these differences, helping you optimize agent behavior.


The AI Judge is an AI-powered evaluator that scores agent performance:

MetricRangeDescription
Accuracy0-100Did the agent reach the correct conclusion?
Faithfulness0-100Did responses stay grounded in provided context?
Trajectory Alignment0-100Did steps match expected flow?
EfficiencyQualitativeWere tool calls and reasoning optimal?

The judge provides categorized recommendations across:

  • Tool usage - Optimizing how tools are called
  • Reasoning - Improving decision-making logic
  • Context - Better use of available information
  • Efficiency - Reducing redundant operations

Use labels to organize test cases and benchmarks with key-value pairs:

  • category:RCA - Root cause analysis
  • difficulty:Easy|Medium|Hard
  • domain:kubernetes|databases|networking
  • framework:langgraph|strands|custom

AgentHealth supports multiple agent frameworks through adapters:

AdapterProtocolUse Case
AG-UISSE StreamingReal-time trajectory capture
Claude CodeCLI + JSON EventsClaude Code CLI agents
LangGraphREST APILangGraph deployments
StrandsAWS Bedrock RuntimeStrands framework
CustomUser-definedAny agent implementation

Understanding how data flows through AgentHealth:

  1. Test Case Created → Stored in OpenSearch
  2. Benchmark Defined → References test cases
  3. Run Executed → Agent invoked, trajectory captured, OTEL spans collected, AI Judge evaluates
  4. Results Stored → Run results indexed
  5. Analysis → Compare runs, track metrics, generate insights