OpenTelemetry Setup

OpenSearch AgentHealth is built on OpenTelemetry (OTEL), the industry standard for observability. This guide covers setting up OTEL instrumentation for your AI agents.

Why OpenTelemetry?

Vendor Neutral

No lock-in. Export to OpenSearch, Jaeger, Prometheus, or any OTEL-compatible backend.

Industry Standard

GenAI Semantic Conventions provide standardized attributes for LLM operations.

Rich Ecosystem

Auto-instrumentation libraries for major frameworks and languages.

Future Proof

CNCF graduated project with broad industry adoption.

Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                        Your AI Agent                              │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐     │
│  │   LLM Calls    │  │   Tool Calls   │  │   Retrievals   │     │
│  └───────┬────────┘  └───────┬────────┘  └───────┬────────┘     │
│          │                   │                   │               │
│          └───────────────────┴───────────────────┘               │
│                              │                                   │
│                    ┌─────────┴─────────┐                         │
│                    │  OTEL SDK         │                         │
│                    │  (Tracer, Meter)  │                         │
│                    └─────────┬─────────┘                         │
└──────────────────────────────┼───────────────────────────────────┘
                               │ OTLP (gRPC:4317 / HTTP:4318)
                               ▼
                    ┌──────────────────────┐
                    │   OTEL Collector     │
                    │  • Memory Limiter    │
                    │  • Batch Processor   │
                    │  • Attribute Filter  │
                    └──────────┬───────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
       ┌──────────┐     ┌──────────┐     ┌──────────┐
       │OpenSearch│     │Prometheus│     │  Jaeger  │
       │ (Traces) │     │ (Metrics)│     │(Optional)│
       └──────────┘     └──────────┘     └──────────┘

GenAI Semantic Conventions

AgentHealth follows the OpenTelemetry GenAI Semantic Conventions for consistent observability:

Required Attributes

Attribute	Description	Example
`gen_ai.system`	The AI system/provider	`openai`, `anthropic`, `bedrock`
`gen_ai.operation.name`	Operation being performed	`chat`, `completion`, `embedding`
`gen_ai.request.model`	Model identifier	`claude-sonnet-4`, `gpt-4o`

Recommended Attributes

Attribute	Description	Example
`gen_ai.usage.input_tokens`	Input token count	`1500`
`gen_ai.usage.output_tokens`	Output token count	`500`
`gen_ai.tool.name`	Tool being invoked	`fetch_logs`, `query_metrics`
`gen_ai.tool.description`	Tool description	`Fetches application logs`

Example Span with Attributes

from opentelemetry import trace
from opentelemetry.semconv.ai import GenAiAttributes

tracer = trace.get_tracer("my-agent")

with tracer.start_as_current_span("llm_call") as span:
    span.set_attribute(GenAiAttributes.GEN_AI_SYSTEM, "anthropic")
    span.set_attribute(GenAiAttributes.GEN_AI_OPERATION_NAME, "chat")
    span.set_attribute(GenAiAttributes.GEN_AI_REQUEST_MODEL, "claude-sonnet-4")

    # Make LLM call
    response = client.messages.create(...)

    # Record token usage
    span.set_attribute(GenAiAttributes.GEN_AI_USAGE_INPUT_TOKENS, response.usage.input_tokens)
    span.set_attribute(GenAiAttributes.GEN_AI_USAGE_OUTPUT_TOKENS, response.usage.output_tokens)

Quick Setup

Python

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# Configure resource
resource = Resource.create({
    "service.name": "my-agent",
    "service.version": "1.0.0",
    "deployment.environment": "production"
})

# Configure tracer provider
provider = TracerProvider(resource=resource)

# Add OTLP exporter
otlp_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317",  # gRPC endpoint
    insecure=True
)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

# Set global tracer provider
trace.set_tracer_provider(provider)

# Get tracer
tracer = trace.get_tracer("my-agent")

JavaScript

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'my-agent',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4317',
  }),
});

sdk.start();

// Get tracer
const tracer = trace.getTracer('my-agent');

OTEL Collector Configuration

The OTEL Collector routes telemetry data to backends:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

  batch:
    send_batch_size: 1024
    timeout: 10s

  attributes:
    actions:
      # Promote GenAI attributes for better queryability
      - key: gen_ai.system
        action: insert
        from_attribute: gen_ai.system
      - key: gen_ai.request.model
        action: insert
        from_attribute: gen_ai.request.model

exporters:
  # OpenSearch for traces
  opensearch:
    http:
      endpoint: http://opensearch:9200
    logs_index: otel-logs
    traces_index: otel-traces

  # Prometheus for metrics
  prometheus:
    endpoint: 0.0.0.0:8889

  # Debug logging (development only)
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [opensearch, debug]

    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [opensearch]

Instrumenting AI Agents

Manual Instrumentation

For custom agents, manually create spans:

from opentelemetry import trace

tracer = trace.get_tracer("my-agent")

def run_agent(prompt: str):
    with tracer.start_as_current_span("agent_run") as root_span:
        root_span.set_attribute("gen_ai.operation.name", "invoke_agent")

        # LLM reasoning
        with tracer.start_as_current_span("llm_reasoning") as llm_span:
            llm_span.set_attribute("gen_ai.system", "anthropic")
            llm_span.set_attribute("gen_ai.request.model", "claude-sonnet-4")

            response = llm_client.generate(prompt)

            llm_span.set_attribute("gen_ai.usage.input_tokens", response.input_tokens)
            llm_span.set_attribute("gen_ai.usage.output_tokens", response.output_tokens)

        # Tool call
        with tracer.start_as_current_span("tool_call") as tool_span:
            tool_span.set_attribute("gen_ai.tool.name", "fetch_logs")
            tool_span.set_attribute("gen_ai.tool.description", "Fetches application logs")

            logs = fetch_logs(service="api")

            tool_span.set_attribute("tool.result.size", len(logs))

        return response

Auto-Instrumentation

Use auto-instrumentation for supported frameworks:

# Install auto-instrumentation packages
# pip install opentelemetry-instrumentation-openai
# pip install opentelemetry-instrumentation-anthropic

from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.instrumentation.anthropic import AnthropicInstrumentor

# Enable auto-instrumentation
OpenAIInstrumentor().instrument()
AnthropicInstrumentor().instrument()

# Now all OpenAI/Anthropic calls are automatically traced
client = anthropic.Anthropic()
response = client.messages.create(...)  # Automatically instrumented

Framework-Specific Setup

HolmesGPT Integration

HolmesGPT natively supports OTEL instrumentation:

from holmesgpt import HolmesGPT
from holmesgpt.telemetry import configure_telemetry

# Configure OTEL for HolmesGPT
configure_telemetry(
    endpoint="http://localhost:4317",
    service_name="holmesgpt-agent"
)

# HolmesGPT operations are now traced
holmes = HolmesGPT()
result = holmes.investigate(issue)

LangGraph Integration

from langgraph.prebuilt import create_react_agent
from opensearch_agentops.integrations import instrument_langgraph

# Enable OTEL instrumentation for LangGraph
instrument_langgraph()

# All LangGraph operations are traced
agent = create_react_agent(model, tools)
result = agent.invoke({"messages": [("user", prompt)]})

Strands Integration

from strands import Agent
from strands.telemetry import StrandsTelemetry

# Configure Strands telemetry
StrandsTelemetry.configure(
    endpoint="http://localhost:4317",
    service_name="strands-agent"
)

# Strands operations are automatically traced
agent = Agent(model="claude-sonnet-4")
result = agent.run(prompt)

Best Practices

Do’s

Use semantic conventions: Follow GenAI conventions for consistent data
Capture token usage: Essential for cost tracking
Set meaningful names: Descriptive span names aid debugging
Include context: Add relevant attributes for filtering

Don’ts

Don’t log sensitive data: Avoid logging prompts with PII
Don’t over-instrument: Focus on meaningful operations
Don’t skip error handling: Record exceptions in spans
Don’t hardcode endpoints: Use environment variables

Troubleshooting

No Traces Appearing

Check collector is running: curl http://localhost:4317
Verify exporter endpoint configuration
Check for batch processor timeout (default 10s)
Enable debug exporter to see outgoing data

Missing Attributes

Verify semantic convention attribute names
Check attribute is set before span ends
Ensure span processor is added to provider

High Memory Usage

Configure memory limiter in collector
Reduce batch size
Implement sampling for high-volume services

Next Steps