Fix #1086: XE sessions self-heal and missing sessions are surfaced (Lite + Dashboard)#1089
Merged
Merged
Conversation
…ite + Dashboard) Lite: - Move EnsureBlockedProcess/DeadlockXeSessionAsync from the tab-open-only path into RunCollectorAsync so the background loop creates/retries the sessions every cycle (cheap existence check once created) - Ensure failures now throw XeSessionEnsureException instead of being swallowed, so a missing session can never be masked by a zero-row "successful" ring-buffer read - Surface failures: CollectorHealthEntry.XeSessionUnavailable, status-bar "Capture down" state (covers PERMISSIONS failures that don't increment ConsecutiveErrors), and an edge-triggered tray notification - Backport STARTUP_STATE = ON to the Azure SQL DB database-scoped DDL so sessions restart after failover Dashboard (install/22 + install/24): - Ensure (create/start) the XE session at the top of every collector run; server-scoped sessions dropped/stopped after install now self-heal - Azure SQL DB: actually create the database-scoped sessions the comments claimed were "auto-created" (nothing created them; capture was 100% non-functional). Also fix the Azure deadlock read to filter on database_xml_deadlock_report instead of xml_deadlock_report - Log SESSION_MISSING (with the real error) instead of SUCCESS when the session is absent and can't be created; ring-buffer read errors now re-raise instead of being swallowed - New Capture Down / Capture Restored alert through the standard pipeline (snoozable tray + email + webhook + history + cooldown + mute), fed by AlertHealthResult.MissingCaptureSessions from the latest collection_log status Tests: 5 new XeSessionHealthTests regression-lock the health bookkeeping; Lite 411/411, Dashboard 476/476. Live-validated on SQL2019: drop->recreate, stop->restart, idempotent re-run, low-priv login logs SESSION_MISSING, next privileged run self-heals. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1086.
Problem
The
PerformanceMonitor_BlockedProcess/PerformanceMonitor_DeadlockExtended Events sessions could be absent while the blocking/deadlock collectors read the non-existent ring buffer, got zero rows, and reported SUCCESS — capture silently dead, no signal. Three variants, all confirmed:SUCCESS/0 forever.xml_deadlock_reportinstead ofdatabase_xml_deadlock_report, so it would have returned zero events even with a session present.Fix
Lite
RunCollectorAsync(gated to the two XE collectors) — both tab-open and background paths now create/start/retry every cycle; existence check is one cheap round-trip once healthy.XeSessionEnsureException) instead of being swallowed, classified PERMISSIONS (229/297/300, keeps existing skip-until-restart semantics) or ERROR. Ensure runs before the read, so zero-rows-SUCCESS can never mask a missing session.XeSessionUnavailableon the health entry, status-bar "Capture down" branch (PERMISSIONS failures don't incrementConsecutiveErrorsand were otherwise invisible), edge-triggered tray balloon per server.STARTUP_STATE = ON(MS-recommended; sessions restart after failover — this was Lite's "stops on reconnect" symptom).Dashboard
BEGIN TRANSACTION— XE DDL isn't allowed inside one; Azure-only catalog views accessed via dynamic SQL per the procs' existing pattern).database_xml_deadlock_reportfor deadlocks) + the read-side event-name fix.SESSION_MISSINGwith the real error, notSUCCESS. Ring-buffer read errors re-raise.collection_logstatus throughGetAlertHealthAsync— no new polling loop. Gated on the blocking/deadlock notify prefs.No
upgrades/entry: proc-body changes only (idempotentinstall/*), andSESSION_MISSING(15 chars) fits the existing unconstrainednvarchar(20)column. No csproj version bump — dev bumps at release time and the 2.11.0→2.12.0 train is owned by parallel work.Known limitation
On Azure SQL DB the blocked-process threshold can't be set via
sp_configureand MS documents no default — the blocked-process session may exist yet capture nothing there (deadlock capture has no such dependency). Verified against MS Learn; needs a live Azure SQL DB target to confirm runtime behavior. Called out in the proc header.Test plan
XeSessionHealthTestsregression-lock the silent-OK → surfaced → self-heal bookkeeping (incl. the PERMISSIONS-invisible case and stale-message clearing)STARTUP_STATE = ON); stop session → next run restarts it; double-run is a no-opVIEW SERVER STATE, noALTER ANY EVENT SESSION) + absent session → logsSESSION_MISSINGwith "User does not have permission…"; next privileged run logsSUCCESSand heals the session🤖 Generated with Claude Code