quality-playbook v1.2.0: state machine analysis and missing safeguard detection (#1238)

andrewstellman · web-flow · commit 74208f679bad · 2026-04-01T10:37:00.000+11:00
Add Step 5a (state machine completeness analysis) and expand Step 6
with missing safeguard detection patterns. These catch two categories
of bugs that defensive pattern analysis alone misses: unhandled states
in lifecycle/status machines, and operations that commit users to
expensive work without adequate preview or termination conditions.
diff --git a/docs/README.skills.md b/docs/README.skills.md
@@ -222,7 +222,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
 | [publish-to-pages](../skills/publish-to-pages/SKILL.md) | Publish presentations and web content to GitHub Pages. Converts PPTX, PDF, HTML, or Google Slides to a live GitHub Pages URL. Handles repo creation, file conversion, Pages enablement, and returns the live URL. Use when the user wants to publish, deploy, or share a presentation or HTML file via GitHub Pages. | `scripts/convert-pdf.py`<br />`scripts/convert-pptx.py`<br />`scripts/publish.sh` |
 | [pytest-coverage](../skills/pytest-coverage/SKILL.md) | Run pytest tests with coverage, discover lines missing coverage, and increase coverage to 100%. | None |
 | [python-mcp-server-generator](../skills/python-mcp-server-generator/SKILL.md) | Generate a complete MCP server project in Python with tools, resources, and proper configuration | None |
-| [quality-playbook](../skills/quality-playbook/SKILL.md) | Explore any codebase from scratch and generate six quality artifacts: a quality constitution (QUALITY.md), spec-traced functional tests, a code review protocol with regression test generation, an integration testing protocol, a multi-model spec audit (Council of Three), and an AI bootstrap file (AGENTS.md). Works with any language (Python, Java, Scala, TypeScript, Go, Rust, etc.). Use this skill whenever the user asks to set up a quality playbook, generate functional tests from specifications, create a quality constitution, build testing protocols, audit code against specs, or establish a repeatable quality system for a project. Also trigger when the user mentions 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', 'coverage theater', or wants to go beyond basic test generation to build a full quality system grounded in their actual codebase. | `LICENSE.txt`<br />`references/constitution.md`<br />`references/defensive_patterns.md`<br />`references/functional_tests.md`<br />`references/review_protocols.md`<br />`references/schema_mapping.md`<br />`references/spec_audit.md`<br />`references/verification.md` |
+| [quality-playbook](../skills/quality-playbook/SKILL.md) | Explore any codebase from scratch and generate six quality artifacts: a quality constitution (QUALITY.md), spec-traced functional tests, a code review protocol with regression test generation, an integration testing protocol, a multi-model spec audit (Council of Three), and an AI bootstrap file (AGENTS.md). Includes state machine completeness analysis and missing safeguard detection. Works with any language (Python, Java, Scala, TypeScript, Go, Rust, etc.). Use this skill whenever the user asks to set up a quality playbook, generate functional tests from specifications, create a quality constitution, build testing protocols, audit code against specs, or establish a repeatable quality system for a project. Also trigger when the user mentions 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', 'coverage theater', or wants to go beyond basic test generation to build a full quality system grounded in their actual codebase. | `LICENSE.txt`<br />`references/constitution.md`<br />`references/defensive_patterns.md`<br />`references/functional_tests.md`<br />`references/review_protocols.md`<br />`references/schema_mapping.md`<br />`references/spec_audit.md`<br />`references/verification.md` |
 | [quasi-coder](../skills/quasi-coder/SKILL.md) | Expert 10x engineer skill for interpreting and implementing code from shorthand, quasi-code, and natural language descriptions. Use when collaborators provide incomplete code snippets, pseudo-code, or descriptions with potential typos or incorrect terminology. Excels at translating non-technical or semi-technical descriptions into production-quality code. | None |
 | [readme-blueprint-generator](../skills/readme-blueprint-generator/SKILL.md) | Intelligent README.md generation prompt that analyzes project documentation structure and creates comprehensive repository documentation. Scans .github/copilot directory files and copilot-instructions.md to extract project information, technology stack, architecture, development workflow, coding standards, and testing approaches while generating well-structured markdown documentation with proper formatting, cross-references, and developer-focused content. | None |
 | [refactor](../skills/refactor/SKILL.md) | Surgical code refactoring to improve maintainability without changing behavior. Covers extracting functions, renaming variables, breaking down god functions, improving type safety, eliminating code smells, and applying design patterns. Less drastic than repo-rebuilder; use for gradual improvements. | None |
diff --git a/skills/quality-playbook/SKILL.md b/skills/quality-playbook/SKILL.md
@@ -1,9 +1,9 @@
 ---
 name: quality-playbook
-description: "Explore any codebase from scratch and generate six quality artifacts: a quality constitution (QUALITY.md), spec-traced functional tests, a code review protocol with regression test generation, an integration testing protocol, a multi-model spec audit (Council of Three), and an AI bootstrap file (AGENTS.md). Works with any language (Python, Java, Scala, TypeScript, Go, Rust, etc.). Use this skill whenever the user asks to set up a quality playbook, generate functional tests from specifications, create a quality constitution, build testing protocols, audit code against specs, or establish a repeatable quality system for a project. Also trigger when the user mentions 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', 'coverage theater', or wants to go beyond basic test generation to build a full quality system grounded in their actual codebase."
+description: "Explore any codebase from scratch and generate six quality artifacts: a quality constitution (QUALITY.md), spec-traced functional tests, a code review protocol with regression test generation, an integration testing protocol, a multi-model spec audit (Council of Three), and an AI bootstrap file (AGENTS.md). Includes state machine completeness analysis and missing safeguard detection. Works with any language (Python, Java, Scala, TypeScript, Go, Rust, etc.). Use this skill whenever the user asks to set up a quality playbook, generate functional tests from specifications, create a quality constitution, build testing protocols, audit code against specs, or establish a repeatable quality system for a project. Also trigger when the user mentions 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', 'coverage theater', or wants to go beyond basic test generation to build a full quality system grounded in their actual codebase."
 license: Complete terms in LICENSE.txt
 metadata:
-  version: 1.1.0
+  version: 1.2.0
   author: Andrew Stellman
   github: https://github.com/andrewstellman/
 ---
@@ -13,7 +13,7 @@ metadata:
 **When this skill starts, display this banner before doing anything else:**
 
 ```
-Quality Playbook v1.1.0 — by Andrew Stellman
+Quality Playbook v1.2.0 — by Andrew Stellman
 https://github.com/andrewstellman/
 ```
 
@@ -158,6 +158,21 @@ This is the most important step. Search for defensive code patterns — each one
 
 Minimum bar: at least 2–3 defensive patterns per core source file. If you find fewer, you're skimming — read function bodies, not just signatures.
 
+### Step 5a: Trace State Machines
+
+If the project has any kind of state management — status fields, lifecycle phases, workflow stages, mode flags — trace the state machine completely. This catches a category of bugs that defensive pattern analysis alone misses: states that exist but aren't handled.
+
+**How to find state machines:** Search for status/state fields in models, enums, or constants (e.g., `status`, `state`, `phase`, `mode`). Search for guards that check status before allowing actions (e.g., `if status == "running"`, `match self.state`). Search for state transitions (assignments to status fields).
+
+**For each state machine you find:**
+
+1. **Enumerate all possible states.** Read the enum, the constants, or grep for every value the field is assigned. List them all.
+2. **For each consumer of state** (UI handlers, API endpoints, control flow guards), check: does it handle every possible state? A `switch`/`match` without a meaningful default, or an `if/elif` chain that doesn't cover all states, is a gap.
+3. **For each state transition**, check: can you reach every state? Are there states you can enter but never leave? Are there states that block operations that should be available?
+4. **Record gaps as findings.** A status guard that allows action X for "running" but not for "stuck" is a real bug if the user needs to perform action X on stuck processes. A process that enters a terminal state but never triggers cleanup is a real bug.
+
+**Why this matters:** State machine gaps produce bugs that are invisible during normal operation but surface under stress or edge conditions — exactly when you need the system to work. A batch processor that can't be killed when it's in "stuck" status, or a watcher that never self-terminates after all work completes, or a UI that refuses to resume a "pending" run, are all symptoms of incomplete state handling. These bugs don't show up in defensive pattern analysis because the code isn't defending against them — it's simply not handling them at all.
+
 ### Step 5b: Map Schema Types
 
 If the project has a validation layer (Pydantic models in Python, JSON Schema, TypeScript interfaces/Zod schemas, Java Bean Validation annotations, Scala case class codecs), read the schema definitions now. For every field you found a defensive pattern for, record what the schema accepts vs. rejects.
@@ -179,6 +194,8 @@ Every project has a different failure profile. This step uses **two sources** 
 - "What produces correct-looking output that is actually wrong?" — This is the most dangerous class of bug: output that passes all checks but is subtly corrupted.
 - "What happens at 10x scale that doesn't happen at 1x?" — Chunk boundaries, rate limits, timeout cascading, memory pressure.
 - "What happens when this process is killed at the worst possible moment?" — Mid-write, mid-transaction, mid-batch-submission.
+- "What information does the user need before committing to an irreversible or expensive operation?" — Pre-run cost estimates, confirmation of scope (especially when fan-out or expansion will multiply the work), resource warnings. If the system can silently commit the user to hours of processing or significant cost without showing them what they're about to do, that's a missing safeguard. Search for operations that start long-running processes, submit batch jobs, or trigger expansion/fan-out — and check whether the user sees a preview, estimate, or confirmation with real numbers before the point of no return.
+- "What happens when a long-running process finishes — does it actually stop?" — Polling loops, watchers, background threads, and daemon processes that run until completion should have explicit termination conditions. If the loop checks "is there more work?" but never checks "is all work done?", it will run forever after completion. This is especially common in batch processors and queue consumers.
 
 Generate realistic failure scenarios from this knowledge. You don't need to have observed these failures — you know from training that they happen to systems of this type. Write them as **architectural vulnerability analyses** with specific quantities and consequences. Frame each as "this architecture permits the following failure mode" — not as a fabricated incident report. Use concrete numbers to make the severity non-negotiable: "If the process crashes mid-write during a 10,000-record batch, `save_state()` without an atomic rename pattern will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention." Then ground them in the actual code you explored: "Read persistence.py line ~340 (save_state): verify temp file + rename pattern."
 
diff --git a/skills/quality-playbook/references/defensive_patterns.md b/skills/quality-playbook/references/defensive_patterns.md
@@ -133,8 +133,69 @@ fn test_defensive_pattern_name() {
 }
 ```
 
+## State Machine Patterns
+
+State machines are a special category of defensive pattern. When you find status fields, lifecycle phases, or mode flags, trace the full state machine — see SKILL.md Step 5a for the complete process.
+
+**How to find state machines:**
+
+| Language | Grep pattern |
+|---|---|
+| Python | `status`, `state`, `phase`, `mode`, `== "running"`, `== "pending"` |
+| Java | `enum.*Status`, `enum.*State`, `.getStatus()`, `switch.*status` |
+| Scala | `sealed trait.*State`, `case object`, `status match` |
+| TypeScript | `status:`, `state:`, `Status =`, `switch.*status` |
+| Go | `Status`, `State`, `type.*Phase`, `switch.*status` |
+| Rust | `enum.*State`, `enum.*Status`, `match.*state` |
+
+**For each state machine found:**
+
+1. List every possible state value (read the enum or grep for assignments)
+2. For each handler/consumer that checks state, verify it handles ALL states
+3. Look for states you can enter but never leave (terminal state without cleanup)
+4. Look for operations that should be available in a state but are blocked by an incomplete guard
+
+**Converting state machine gaps to scenarios:**
+
+```markdown
+### Scenario N: [Status] blocks [operation]
+
+**Requirement tag:** [Req: inferred — from handler() status guard]
+
+**What happened:** The [handler] only allows [operation] when status is "[allowed_states]", but the system can enter "[missing_state]" status (e.g., due to [condition]). When this happens, the user cannot [operation] and has no workaround through the interface.
+
+**The requirement:** [operation] must be available in all states where the user would reasonably need it, including [missing_state].
+
+**How to verify:** Set up a [entity] in "[missing_state]" status. Attempt [operation]. Assert it succeeds or provides a clear error with a workaround.
+```
+
+## Missing Safeguard Patterns
+
+Search for operations that commit the user to expensive, irreversible, or long-running work without adequate preview or confirmation:
+
+| Pattern | What to look for |
+|---|---|
+| Pre-commit information gap | Operations that start batch jobs, fan-out expansions, or API calls without showing estimated cost, scope, or duration |
+| Silent expansion | Fan-out or multiplication steps where the final work count isn't known until runtime, with no warning shown |
+| No termination condition | Polling loops, watchers, or daemon processes that check for new work but never check whether all work is done |
+| Retry without backoff | Error handling that retries immediately or on a fixed interval without exponential backoff, risking rate limit floods |
+
+**Converting missing safeguards to scenarios:**
+
+```markdown
+### Scenario N: No [safeguard] before [operation]
+
+**Requirement tag:** [Req: inferred — from init_run()/start_watch() behavior]
+
+**What happened:** [Operation] commits the user to [consequence] without showing [missing information]. In practice, a [example] fanned out from [small number] to [large number] units with no warning, resulting in [cost/time consequence].
+
+**The requirement:** Before committing to [operation], display [safeguard] showing [what the user needs to see].
+
+**How to verify:** Initiate [operation] and assert that [safeguard information] is displayed before the point of no return.
+```
+
 ## Minimum Bar
 
 You should find at least 2–3 defensive patterns per source file in the core logic modules. If you find fewer, read function bodies more carefully — not just signatures and comments.
 
-For a medium-sized project (5–15 source files), expect to find 15–30 defensive patterns total. Each one should produce at least one boundary test.
+For a medium-sized project (5–15 source files), expect to find 15–30 defensive patterns total. Each one should produce at least one boundary test. Additionally, trace at least one state machine if the project has status/state fields, and check at least one long-running operation for missing safeguards.