feat(otel): instrument runtime with GenAI semantic conventions#2620
feat(otel): instrument runtime with GenAI semantic conventions#2620tdabasinskas wants to merge 14 commits into
Conversation
fa4a01d to
2a69313
Compare
|
@tdabasinskas not sure why, GitHub doesn't want to merge this one, because of hypothetical merge conflicts. Could you rebase? |
2a69313 to
9b08feb
Compare
Done! |
e7194da to
b6a181b
Compare
|
/review |
I don't think that worked 😅 |
|
/review |
aheritier
left a comment
There was a problem hiding this comment.
LGTM. Clean design, solid thread safety, good spec adherence. The inline comments are all non-blocking suggestions for follow-up.
|
❌ PR Review Failed — The review agent encountered an error and could not complete the review. View logs. |
|
@tdabasinskas can you rebase one more time and I'll review it? |
Done! |
aheritier
left a comment
There was a problem hiding this comment.
Re-approving — my prior approval was dismissed by the merge of upstream/main into the branch, but there are zero new author code changes since a4ce95e8. All three of my previous comments were addressed and the threads are resolved. CI is green on the merge commit.
Original assessment stands: clean design, solid thread safety, good GenAI semconv adherence. LGTM.
Every toolset goes through tools.WithName in the team-loader
registry, which sandwiches a *tools.namedToolSet between the
StartableToolSet and the actual implementation. %T on the
embedded ToolSet therefore always reported *tools.namedToolSet
regardless of whether the inner toolset was MCP, A2A, a builtin,
or anything else - so the attribute could never answer the
question it exists to answer ("which kind of toolset is starting
right now?").
Unwrap once before formatting, mirroring what DescribeToolSet
already does for the same reason. Now the attribute reads
*mcp.Toolset, *builtin.ShellTool, etc., so a toolset.start
without HTTP children is immediately distinguishable from a
remote MCP whose POSTs are missing for some other reason.
Record tool counts at two key points in the execution flow: - Session span: total tools available after exclusion filters - MCP list span: tools successfully yielded by each server These attributes enable quick analysis of tool availability without inspecting nested spans or JSON-RPC payloads. The MCP count preserves partial results when iteration terminates early.
…errors Introduce a `classifyByStatusCode` helper that probes for an HTTP status code via a `StatusCode() int` method before falling back to substring matching. This prevents false positives when error messages incidentally contain strings like "401", "403", or "429" in request IDs, byte counts, or status-line fragments. Providers that expose HTTP status codes through a structured interface now get classified from the structural signal, while text-only errors continue to use the existing heuristic. Also add documentation clarifying that `getInstruments` binds to the global MeterProvider on first call via `sync.Once`, which affects test setup requirements.
b43ca96 to
79bc9eb
Compare
Adds end-to-end OpenTelemetry instrumentation following the GenAI semantic conventions:
chat/embeddings/rerankCLIENT spans withgen_ai.*attributes and thegen_ai.client.token.usage/operation.durationhistograms.runtime.session,runtime.stream,runtime.fallback,runtime.tool.call,runtime.run_skill,runtime.task_transfer,runtime.handoff,background_agent.run).params._metapropagation, plus OAuth flow spans.otelhttpand marked asinvoke_agent.docker exec.service.*,host.*,process.*,os.type)This PR wires two opt-in env vars beyond the default OTel SDK ones:
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT— capture prompts, responses, tool arguments and tool results as span attributes. Off by default (PII surface).OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental— emit only the spec-definedgen_ai.*keys. Default is dual-emit (bothgen_ai.*and the legacytool.name/agent/session.idkeys), so existing dashboards keep working alongside spec-aware tooling.The diff is large — ~50 files, ~5k lines. It's split into 10 topical commits (telemetry primitives → SDK init → providers → runtime → hooks → MCP → A2A → servers/cold-start → memory/RAG → tool internals) so each commit is independently reviewable. Most of the volume is in the new
pkg/telemetry/genai/andpkg/telemetry/mcp/packages, which are pure helpers; the surface-area changes elsewhere are 1-3 lines per call site.