OpenTelemetry Setup
OpenSearch AgentHealth is built on OpenTelemetry (OTEL), the industry standard for observability. This guide covers setting up OTEL instrumentation for your AI agents.
Why OpenTelemetry?
Section titled βWhy OpenTelemetry?βVendor Neutral
Section titled βVendor NeutralβNo lock-in. Export to OpenSearch, Jaeger, Prometheus, or any OTEL-compatible backend.
Industry Standard
Section titled βIndustry StandardβGenAI Semantic Conventions provide standardized attributes for LLM operations.
Rich Ecosystem
Section titled βRich EcosystemβAuto-instrumentation libraries for major frameworks and languages.
Future Proof
Section titled βFuture ProofβCNCF graduated project with broad industry adoption.
Architecture Overview
Section titled βArchitecture Overviewββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Your AI Agent ββ ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββ ββ β LLM Calls β β Tool Calls β β Retrievals β ββ βββββββββ¬βββββββββ βββββββββ¬βββββββββ βββββββββ¬βββββββββ ββ β β β ββ βββββββββββββββββββββ΄ββββββββββββββββββββ ββ β ββ βββββββββββ΄ββββββββββ ββ β OTEL SDK β ββ β (Tracer, Meter) β ββ βββββββββββ¬ββββββββββ βββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ β OTLP (gRPC:4317 / HTTP:4318) βΌ ββββββββββββββββββββββββ β OTEL Collector β β β’ Memory Limiter β β β’ Batch Processor β β β’ Attribute Filter β ββββββββββββ¬ββββββββββββ β ββββββββββββββββββΌβββββββββββββββββ βΌ βΌ βΌ ββββββββββββ ββββββββββββ ββββββββββββ βOpenSearchβ βPrometheusβ β Jaeger β β (Traces) β β (Metrics)β β(Optional)β ββββββββββββ ββββββββββββ ββββββββββββGenAI Semantic Conventions
Section titled βGenAI Semantic ConventionsβAgentHealth follows the OpenTelemetry GenAI Semantic Conventions for consistent observability:
Required Attributes
Section titled βRequired Attributesβ| Attribute | Description | Example |
|---|---|---|
gen_ai.system | The AI system/provider | openai, anthropic, bedrock |
gen_ai.operation.name | Operation being performed | chat, completion, embedding |
gen_ai.request.model | Model identifier | claude-sonnet-4, gpt-4o |
Recommended Attributes
Section titled βRecommended Attributesβ| Attribute | Description | Example |
|---|---|---|
gen_ai.usage.input_tokens | Input token count | 1500 |
gen_ai.usage.output_tokens | Output token count | 500 |
gen_ai.tool.name | Tool being invoked | fetch_logs, query_metrics |
gen_ai.tool.description | Tool description | Fetches application logs |
Example Span with Attributes
Section titled βExample Span with Attributesβfrom opentelemetry import tracefrom opentelemetry.semconv.ai import GenAiAttributes
tracer = trace.get_tracer("my-agent")
with tracer.start_as_current_span("llm_call") as span: span.set_attribute(GenAiAttributes.GEN_AI_SYSTEM, "anthropic") span.set_attribute(GenAiAttributes.GEN_AI_OPERATION_NAME, "chat") span.set_attribute(GenAiAttributes.GEN_AI_REQUEST_MODEL, "claude-sonnet-4")
# Make LLM call response = client.messages.create(...)
# Record token usage span.set_attribute(GenAiAttributes.GEN_AI_USAGE_INPUT_TOKENS, response.usage.input_tokens) span.set_attribute(GenAiAttributes.GEN_AI_USAGE_OUTPUT_TOKENS, response.usage.output_tokens)Quick Setup
Section titled βQuick Setupβfrom opentelemetry import tracefrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import BatchSpanProcessorfrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporterfrom opentelemetry.sdk.resources import Resource
# Configure resourceresource = Resource.create({ "service.name": "my-agent", "service.version": "1.0.0", "deployment.environment": "production"})
# Configure tracer providerprovider = TracerProvider(resource=resource)
# Add OTLP exporterotlp_exporter = OTLPSpanExporter( endpoint="http://localhost:4317", # gRPC endpoint insecure=True)provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
# Set global tracer providertrace.set_tracer_provider(provider)
# Get tracertracer = trace.get_tracer("my-agent")JavaScript
Section titled βJavaScriptβimport { NodeSDK } from '@opentelemetry/sdk-node';import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';import { Resource } from '@opentelemetry/resources';import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({ resource: new Resource({ [SemanticResourceAttributes.SERVICE_NAME]: 'my-agent', [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0', }), traceExporter: new OTLPTraceExporter({ url: 'http://localhost:4317', }),});
sdk.start();
// Get tracerconst tracer = trace.getTracer('my-agent');OTEL Collector Configuration
Section titled βOTEL Collector ConfigurationβThe OTEL Collector routes telemetry data to backends:
receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318
processors: memory_limiter: check_interval: 1s limit_mib: 1000 spike_limit_mib: 200
batch: send_batch_size: 1024 timeout: 10s
attributes: actions: # Promote GenAI attributes for better queryability - key: gen_ai.system action: insert from_attribute: gen_ai.system - key: gen_ai.request.model action: insert from_attribute: gen_ai.request.model
exporters: # OpenSearch for traces opensearch: http: endpoint: http://opensearch:9200 logs_index: otel-logs traces_index: otel-traces
# Prometheus for metrics prometheus: endpoint: 0.0.0.0:8889
# Debug logging (development only) debug: verbosity: detailed
service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch, attributes] exporters: [opensearch, debug]
metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheus]
logs: receivers: [otlp] processors: [memory_limiter, batch] exporters: [opensearch]Instrumenting AI Agents
Section titled βInstrumenting AI AgentsβManual Instrumentation
Section titled βManual InstrumentationβFor custom agents, manually create spans:
from opentelemetry import trace
tracer = trace.get_tracer("my-agent")
def run_agent(prompt: str): with tracer.start_as_current_span("agent_run") as root_span: root_span.set_attribute("gen_ai.operation.name", "invoke_agent")
# LLM reasoning with tracer.start_as_current_span("llm_reasoning") as llm_span: llm_span.set_attribute("gen_ai.system", "anthropic") llm_span.set_attribute("gen_ai.request.model", "claude-sonnet-4")
response = llm_client.generate(prompt)
llm_span.set_attribute("gen_ai.usage.input_tokens", response.input_tokens) llm_span.set_attribute("gen_ai.usage.output_tokens", response.output_tokens)
# Tool call with tracer.start_as_current_span("tool_call") as tool_span: tool_span.set_attribute("gen_ai.tool.name", "fetch_logs") tool_span.set_attribute("gen_ai.tool.description", "Fetches application logs")
logs = fetch_logs(service="api")
tool_span.set_attribute("tool.result.size", len(logs))
return responseAuto-Instrumentation
Section titled βAuto-InstrumentationβUse auto-instrumentation for supported frameworks:
# Install auto-instrumentation packages# pip install opentelemetry-instrumentation-openai# pip install opentelemetry-instrumentation-anthropic
from opentelemetry.instrumentation.openai import OpenAIInstrumentorfrom opentelemetry.instrumentation.anthropic import AnthropicInstrumentor
# Enable auto-instrumentationOpenAIInstrumentor().instrument()AnthropicInstrumentor().instrument()
# Now all OpenAI/Anthropic calls are automatically tracedclient = anthropic.Anthropic()response = client.messages.create(...) # Automatically instrumentedFramework-Specific Setup
Section titled βFramework-Specific SetupβHolmesGPT Integration
Section titled βHolmesGPT IntegrationβHolmesGPT natively supports OTEL instrumentation:
from holmesgpt import HolmesGPTfrom holmesgpt.telemetry import configure_telemetry
# Configure OTEL for HolmesGPTconfigure_telemetry( endpoint="http://localhost:4317", service_name="holmesgpt-agent")
# HolmesGPT operations are now tracedholmes = HolmesGPT()result = holmes.investigate(issue)LangGraph Integration
Section titled βLangGraph Integrationβfrom langgraph.prebuilt import create_react_agentfrom opensearch_agentops.integrations import instrument_langgraph
# Enable OTEL instrumentation for LangGraphinstrument_langgraph()
# All LangGraph operations are tracedagent = create_react_agent(model, tools)result = agent.invoke({"messages": [("user", prompt)]})Strands Integration
Section titled βStrands Integrationβfrom strands import Agentfrom strands.telemetry import StrandsTelemetry
# Configure Strands telemetryStrandsTelemetry.configure( endpoint="http://localhost:4317", service_name="strands-agent")
# Strands operations are automatically tracedagent = Agent(model="claude-sonnet-4")result = agent.run(prompt)Best Practices
Section titled βBest Practicesβ- Use semantic conventions: Follow GenAI conventions for consistent data
- Capture token usage: Essential for cost tracking
- Set meaningful names: Descriptive span names aid debugging
- Include context: Add relevant attributes for filtering
Donβts
Section titled βDonβtsβ- Donβt log sensitive data: Avoid logging prompts with PII
- Donβt over-instrument: Focus on meaningful operations
- Donβt skip error handling: Record exceptions in spans
- Donβt hardcode endpoints: Use environment variables
Troubleshooting
Section titled βTroubleshootingβNo Traces Appearing
Section titled βNo Traces Appearingβ- Check collector is running:
curl http://localhost:4317 - Verify exporter endpoint configuration
- Check for batch processor timeout (default 10s)
- Enable debug exporter to see outgoing data
Missing Attributes
Section titled βMissing Attributesβ- Verify semantic convention attribute names
- Check attribute is set before span ends
- Ensure span processor is added to provider
High Memory Usage
Section titled βHigh Memory Usageβ- Configure memory limiter in collector
- Reduce batch size
- Implement sampling for high-volume services
Next Steps
Section titled βNext StepsβCollector Configuration
Section titled βCollector ConfigurationβAuto-Instrumentation
Section titled βAuto-InstrumentationβFramework auto-instrumentation β