Skip to content

feat(dsql): Add system diagnostics workflow (Workflow 12)#207

Open
Morlej wants to merge 4 commits into
awslabs:mainfrom
Morlej:feat/dsql-system-diagnostics
Open

feat(dsql): Add system diagnostics workflow (Workflow 12)#207
Morlej wants to merge 4 commits into
awslabs:mainfrom
Morlej:feat/dsql-system-diagnostics

Conversation

@Morlej

@Morlej Morlej commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Adds CloudWatch AAS-based system diagnostics to the DSQL skill as Workflow 12.

What it does

Uses PromQL queries against db.active_sessions.avg OTel metrics to run a mandatory full diagnostic sweep across 5 phases:

  1. Discovery & Baseline Comparison — wait event distribution shifts (current vs yesterday vs last week)
  2. Top-SQL Regression Detection — queries that are new or growing in the top-N
  3. Workload Attribution — application/role changes
  4. Commit & OCC Analysis — distinguish volume growth from conflict growth via CW metrics
  5. Inflection Point Detection — pinpoint when the change occurred

Routes identified queries to Workflow 9 (Query Plan Explainability) for per-query investigation.

OTel attribute naming

Uses the proposed naming convention:

  • db.wait.event, db.wait.class, db.session.state
  • db.query.id, db.query.normalized_text
  • aws.auroradsql.session.role.arn, application.name

Key design decisions

  • No absolute AAS thresholds — only relative distribution shifts (>30% change flagged)
  • A single slow query = max 1 AAS; high AAS = high concurrency/frequency
  • All per-query recommendations deferred to Workflow 9 (EXPLAIN analysis)
  • Agent MUST execute all phases — no stopping at first finding
  • Performance Routing table prevents bypass via other workflows
  • PromQL discovery requires match parameter (documented as critical rule)
  • Requires CloudWatch MCP server in the same region as the DSQL cluster

Files

  • references/system-diagnostics/workflow.md — 5-phase diagnostic procedure
  • references/system-diagnostics/wait-events.md — canonical DSQL wait event reference
  • references/system-diagnostics/promql-patterns.md — reusable PromQL templates
  • .mcp.json — adds cloudwatch MCP server (disabled by default)
  • SKILL.md — updated description, tags, reference table, Workflow 12, Performance Routing table
  • Version bump to 1.5.0

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

Morlej added 4 commits June 29, 2026 14:18
Add CloudWatch AAS-based system diagnostics to the DSQL skill. Uses
PromQL queries against db.active_sessions.avg to detect temporal anomalies
in wait event distribution and identify regressed queries, then routes to
Workflow 9 (Query Plan Explainability) for per-query investigation.

OTel attribute names use the proposed naming convention:
- db.wait.event, db.wait.class, db.session.state
- db.query.id, db.query.normalized_text
- aws.auroradsql.session.role.arn, application.name

New files:
- references/system-diagnostics/workflow.md — 6 diagnostic sub-workflows
- references/system-diagnostics/wait-events.md — canonical wait event reference
- references/system-diagnostics/promql-patterns.md — reusable PromQL templates

Also:
- Adds cloudwatch MCP server to .mcp.json (disabled by default)
- Bumps plugin version to 1.5.0
Add a decision table before Common Workflows that routes performance
complaints to Workflow 12 (System Diagnostics) instead of allowing
them to fall through to Workflow 9 (Query Plan Explainability) directly.

Rule: when in doubt, start with Workflow 12 — it identifies specific
queries and routes to Workflow 9 with context.
- Use correct get_promql_label_values syntax with match parameter
- Add note that calls without match filter return empty
- Add PromQL syntax rules: quote labels with dots/@, use __name__ selector
- Add explicit discovery step to Workflow 1
- Fix promql-patterns.md to show actual tool parameter names (label_name, match)
Replace separate numbered workflows (1-6) with a single diagnostic
procedure of 5 mandatory phases. The agent MUST execute ALL phases
before presenting results — no stopping at the first finding.

Phases:
1. Discovery and Baseline Comparison (distribution shifts)
2. Top-SQL Regression Detection (new/growing queries)
3. Workload Attribution (application/role changes)
4. Commit and OCC Analysis (volume vs conflicts)
5. Inflection Point Detection (when did it change)

Adds 'Presenting Results' section mandating a unified report across
all dimensions before handoff to Workflow 9.
@Morlej Morlej marked this pull request as ready for review June 29, 2026 20:46
@Morlej Morlej requested review from a team as code owners June 29, 2026 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant