refactor(telemetry): migrate stream_with_chunking telemetry to plugin hooks#1361
Open
ajbozarth wants to merge 3 commits into
Open
refactor(telemetry): migrate stream_with_chunking telemetry to plugin hooks#1361ajbozarth wants to merge 3 commits into
ajbozarth wants to merge 3 commits into
Conversation
… hooks (generative-computing#1290) Move the stream_with_chunking orchestration span, its span events, and its metrics off inline telemetry calls onto the STREAMING_START/EVENT/END plugin hooks, completing the Phase 2 tracing migration. STREAMING_START fires in the caller task and STREAMING_END from acomplete(), giving the streaming span same-task attach/detach and correct stream_with_chunking > chat nesting. Removes the four deprecated public tracing helpers (trace_application, set_span_attribute, set_span_error, set_span_status_error). Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
…ask detach The backend `chat` span opened by stream_with_chunking attaches in the caller task but finishes in the orchestration task that drains the MOT. Detaching its OTel context token from that second task failed (logged by OTel as "Failed to detach context"), and the failed detach left the generation span ambient, so a subsequent validation `chat` span nested under generation instead of being its sibling under stream_with_chunking. Add STREAMING_ORCHESTRATION_START/END hooks fired on the orchestration task. The tracing plugin uses them to re-attach the stream_with_chunking span as that task's ambient context for the drain/validate loop, so mid-stream spans parent correctly. _safe_detach skips a detach that would cross asyncio tasks: within a reattach scope the skip is expected and logged at debug; otherwise the detach runs so OTel surfaces its own error, preceded by a warning naming the mismatch. Add unit coverage for the reattach helpers, the task-identity classification, and both log paths; an integration test asserting both chat spans are siblings under stream_with_chunking and carry no streaming events; and an e2e streaming test. Document the stream_with_chunking span and its hierarchy. Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request
Issue
Fixes #1290
Description
Migrates
stream_with_chunkingtelemetry to plugin hooks as part of the Phase 2 migration: moves the orchestration span and its streaming metrics off directmellea.telemetrycalls and onto plugin hooks. With this change, no async path inmellea/stdlib/calls the oldtrace_application/set_span_*API, so the deprecated helpers can finally be removed.Span and metrics on hooks. New streaming hook types in
mellea/plugins/types.py(with payload dataclasses inmellea/plugins/hooks/streaming.py) carry the data previously stamped inline — chunk index, requirement counts, pass/fail, exceptions — keyed by astreaming_idso pre/post hooks pair cleanly across the_run_async_in_threadtasks. A newStreamingTracingPluginsubscribes to these hooks and emits the orchestration span via internal helpers inmellea/telemetry/tracing.py; the existing metrics plugins gain streaming subscriptions, replacing the inlinerecord_*calls. Metrics record fire-and-forget so they never block the stream. Spans, events, and metric names/attributes matchmain— no observability regression — and the span still nests correctly under the action span when called via a session.Cleanup. With streaming migrated, the four deprecated public tracing helpers (
trace_application,set_span_attribute,set_span_error,set_span_status_error) are removed; the internal setter logic moves to_tracing_setters.py.Bug fix surfaced during migration. The backend
chatspan attaches in the caller task but finishes in the orchestration task that drains the MOT, so its OTel context detach crossed asyncio tasks and failed — leaving the generation span ambient and mis-nesting validationchatspans under generation instead of as its sibling. The fix adds orchestration-task hooks that re-attach thestream_with_chunkingspan as ambient context for the drain/validate loop, and_safe_detachnow skips cross-task detaches within a reattach scope. See6c3de9e5for details.Follow-up bug fix.
functional.pynow catchesBaseException(not justException) in the component error path, soCancelledError/KeyboardInterruptfireCOMPONENT_POST_ERROR(closing the action span) before propagating. Caught while writing this PR; the handler re-raises, so nothing is swallowed.Docs.
docs/docs/observability/tracing.mddocuments thestream_with_chunkingspan (attributes + lifecycle events) and shows the twochatspans as siblings under it.Testing
Attribution
Adding a new component, requirement, sampling strategy, or tool?
If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.
NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.