Core Concepts
This guide covers the fundamental concepts behind OpenSearch AgentHealthâtest cases, benchmarks, traces, and trajectoriesâthat youâll encounter when building observable AI agents.
Test Cases
Section titled âTest CasesâA test case defines a scenario to evaluate your agent. Each test case includes:
- Initial prompt: The question or task for the agent
- Context: Supporting data like logs, metrics, or documentation
- Available tools: What the agent can use
- Expected outcomes: Descriptions of expected behavior
Context Items
Section titled âContext ItemsâContext provides the agent with information it needs:
- Logs: Application logs, error messages
- Metrics: Performance data, dashboards
- Documentation: Runbooks, architecture docs
- Previous Conversations: Multi-turn context
Expected Outcomes
Section titled âExpected OutcomesâUnlike exact match testing, expected outcomes describe what should happen:
- Agent should identify the database connection timeout- Agent should query the metrics service for latency data- Agent should NOT make more than 3 tool calls- Agent should recommend increasing connection pool sizeThe AI Judge evaluates whether the agentâs actual trajectory aligns with these expectations.
Benchmarks
Section titled âBenchmarksâA benchmark is a collection of test cases grouped for evaluation. Use benchmarks to:
- Group related test cases (e.g., âRCA Scenariosâ, âQuery Generationâ)
- Create regression test suites
- Compare agent performance across versions
Traces and Spans
Section titled âTraces and SpansâTraces are the foundation of observability. A trace represents a complete execution of your agent, from the initial prompt to the final response.
Note: OpenSearch AgentHealth is built on OpenTelemetry, making it compatible with any OTEL-instrumented agent framework.
Span Categories
Section titled âSpan CategoriesâEach span in a trace is automatically categorized:
| Category | Description |
|---|---|
| AGENT | Root agent operation |
| LLM | Language model inference |
| TOOL | Tool/function invocation |
| RETRIEVAL | Vector/data retrieval |
Key Attributes
Section titled âKey AttributesâOpenTelemetry GenAI semantic conventions capture:
- Operation name (e.g., âinvoke_agentâ)
- System (e.g., âclaudeâ, âbedrockâ)
- Model ID
- Tool being invoked
- Input/output token counts
Trajectories
Section titled âTrajectoriesâA trajectory is the sequence of steps an agent takes to complete a task. Unlike simple output comparison, trajectory analysis captures:
- Thoughts: The agentâs reasoning process
- Actions: Decisions to use tools or generate responses
- Tool Calls: Which tools were invoked and with what parameters
- Results: Outcomes from each tool invocation
Why Trajectory Matters
Section titled âWhy Trajectory MattersâMost evaluation tools only check if the agent got the âright answer.â But two agents can reach the same conclusion through very different paths:
Agent A (Optimal)
- Analyze error log
- Query metrics database
- Identify root cause
- Recommend fix
Agent B (Inefficient)
- Query 5 different databases
- Generate speculative hypotheses
- Make redundant API calls
- Eventually reach same conclusion
Trajectory evaluation reveals these differences, helping you optimize agent behavior.
AI Judge Evaluation
Section titled âAI Judge EvaluationâThe AI Judge is an AI-powered evaluator that scores agent performance:
Evaluation Metrics
Section titled âEvaluation Metricsâ| Metric | Range | Description |
|---|---|---|
| Accuracy | 0-100 | Did the agent reach the correct conclusion? |
| Faithfulness | 0-100 | Did responses stay grounded in provided context? |
| Trajectory Alignment | 0-100 | Did steps match expected flow? |
| Efficiency | Qualitative | Were tool calls and reasoning optimal? |
Improvement Strategies
Section titled âImprovement StrategiesâThe judge provides categorized recommendations across:
- Tool usage - Optimizing how tools are called
- Reasoning - Improving decision-making logic
- Context - Better use of available information
- Efficiency - Reducing redundant operations
Labels and Organization
Section titled âLabels and OrganizationâUse labels to organize test cases and benchmarks with key-value pairs:
category:RCA- Root cause analysisdifficulty:Easy|Medium|Harddomain:kubernetes|databases|networkingframework:langgraph|strands|custom
Agent Adapters
Section titled âAgent AdaptersâAgentHealth supports multiple agent frameworks through adapters:
| Adapter | Protocol | Use Case |
|---|---|---|
| AG-UI | SSE Streaming | Real-time trajectory capture |
| Claude Code | CLI + JSON Events | Claude Code CLI agents |
| LangGraph | REST API | LangGraph deployments |
| Strands | AWS Bedrock Runtime | Strands framework |
| Custom | User-defined | Any agent implementation |
Data Flow
Section titled âData FlowâUnderstanding how data flows through AgentHealth:
- Test Case Created â Stored in OpenSearch
- Benchmark Defined â References test cases
- Run Executed â Agent invoked, trajectory captured, OTEL spans collected, AI Judge evaluates
- Results Stored â Run results indexed
- Analysis â Compare runs, track metrics, generate insights