Skip to content

Fix #1086: XE sessions self-heal and missing sessions are surfaced (Lite + Dashboard)#1089

Merged
erikdarlingdata merged 1 commit into
devfrom
feature/1086-xe-session-self-heal
Jun 9, 2026
Merged

Fix #1086: XE sessions self-heal and missing sessions are surfaced (Lite + Dashboard)#1089
erikdarlingdata merged 1 commit into
devfrom
feature/1086-xe-session-self-heal

Conversation

@erikdarlingdata

Copy link
Copy Markdown
Owner

Fixes #1086.

Problem

The PerformanceMonitor_BlockedProcess / PerformanceMonitor_Deadlock Extended Events sessions could be absent while the blocking/deadlock collectors read the non-existent ring buffer, got zero rows, and reported SUCCESS — capture silently dead, no signal. Three variants, all confirmed:

  1. Lite (the filed issue): sessions were created only on the tab-open path; the background loop never created or retried them. A failed/never-run first attempt stayed broken until a manual tab re-open.
  2. Dashboard server-scoped: sessions created once at install; if later stopped/dropped, procs 22/24 swallowed the error and logged SUCCESS/0 forever.
  3. Dashboard Azure SQL DB: comments claimed the database-scoped sessions were "auto-created by the collection procedures" — nothing created them anywhere. Capture was 100% non-functional on Azure SQL DB. Bonus latent bug: the Azure deadlock read filtered on xml_deadlock_report instead of database_xml_deadlock_report, so it would have returned zero events even with a session present.

Fix

Lite

  • Session ensure moved into RunCollectorAsync (gated to the two XE collectors) — both tab-open and background paths now create/start/retry every cycle; existence check is one cheap round-trip once healthy.
  • Ensure failures throw (XeSessionEnsureException) instead of being swallowed, classified PERMISSIONS (229/297/300, keeps existing skip-until-restart semantics) or ERROR. Ensure runs before the read, so zero-rows-SUCCESS can never mask a missing session.
  • Surfacing: XeSessionUnavailable on the health entry, status-bar "Capture down" branch (PERMISSIONS failures don't increment ConsecutiveErrors and were otherwise invisible), edge-triggered tray balloon per server.
  • Azure DDL gains STARTUP_STATE = ON (MS-recommended; sessions restart after failover — this was Lite's "stops on reconnect" symptom).

Dashboard

  • Procs 22/24 ensure the session at the top of every run (before BEGIN TRANSACTION — XE DDL isn't allowed inside one; Azure-only catalog views accessed via dynamic SQL per the procs' existing pattern).
  • Azure SQL DB database-scoped sessions are now actually created (database_xml_deadlock_report for deadlocks) + the read-side event-name fix.
  • Honest logging: genuinely-absent-and-uncreatable session → SESSION_MISSING with the real error, not SUCCESS. Ring-buffer read errors re-raise.
  • New Capture Down / Capture Restored alert via the standard engine (snoozable tray + email + webhook + history + cooldown + mute), fed by the latest collection_log status through GetAlertHealthAsync — no new polling loop. Gated on the blocking/deadlock notify prefs.

No upgrades/ entry: proc-body changes only (idempotent install/*), and SESSION_MISSING (15 chars) fits the existing unconstrained nvarchar(20) column. No csproj version bump — dev bumps at release time and the 2.11.0→2.12.0 train is owned by parallel work.

Known limitation

On Azure SQL DB the blocked-process threshold can't be set via sp_configure and MS documents no default — the blocked-process session may exist yet capture nothing there (deadlock capture has no such dependency). Verified against MS Learn; needs a live Azure SQL DB target to confirm runtime behavior. Called out in the proc header.

Test plan

  • 5 new XeSessionHealthTests regression-lock the silent-OK → surfaced → self-heal bookkeeping (incl. the PERMISSIONS-invisible case and stale-message clearing)
  • Lite suite 411/411; Dashboard suite 476/476; both apps build clean
  • Live on SQL2019: procs deploy idempotently; drop both sessions → next run recreates them (running, STARTUP_STATE = ON); stop session → next run restarts it; double-run is a no-op
  • Live on SQL2019: low-priv login (VIEW SERVER STATE, no ALTER ANY EVENT SESSION) + absent session → logs SESSION_MISSING with "User does not have permission…"; next privileged run logs SUCCESS and heals the session
  • Azure SQL DB runtime verification (no target available)

🤖 Generated with Claude Code

…ite + Dashboard)

Lite:
- Move EnsureBlockedProcess/DeadlockXeSessionAsync from the tab-open-only
  path into RunCollectorAsync so the background loop creates/retries the
  sessions every cycle (cheap existence check once created)
- Ensure failures now throw XeSessionEnsureException instead of being
  swallowed, so a missing session can never be masked by a zero-row
  "successful" ring-buffer read
- Surface failures: CollectorHealthEntry.XeSessionUnavailable, status-bar
  "Capture down" state (covers PERMISSIONS failures that don't increment
  ConsecutiveErrors), and an edge-triggered tray notification
- Backport STARTUP_STATE = ON to the Azure SQL DB database-scoped DDL so
  sessions restart after failover

Dashboard (install/22 + install/24):
- Ensure (create/start) the XE session at the top of every collector run;
  server-scoped sessions dropped/stopped after install now self-heal
- Azure SQL DB: actually create the database-scoped sessions the comments
  claimed were "auto-created" (nothing created them; capture was 100%
  non-functional). Also fix the Azure deadlock read to filter on
  database_xml_deadlock_report instead of xml_deadlock_report
- Log SESSION_MISSING (with the real error) instead of SUCCESS when the
  session is absent and can't be created; ring-buffer read errors now
  re-raise instead of being swallowed
- New Capture Down / Capture Restored alert through the standard pipeline
  (snoozable tray + email + webhook + history + cooldown + mute), fed by
  AlertHealthResult.MissingCaptureSessions from the latest collection_log
  status

Tests: 5 new XeSessionHealthTests regression-lock the health bookkeeping;
Lite 411/411, Dashboard 476/476. Live-validated on SQL2019: drop->recreate,
stop->restart, idempotent re-run, low-priv login logs SESSION_MISSING,
next privileged run self-heals.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@erikdarlingdata erikdarlingdata merged commit f49027d into dev Jun 9, 2026
6 checks passed
@erikdarlingdata erikdarlingdata deleted the feature/1086-xe-session-self-heal branch June 9, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant