|
| 1 | +# Writing the Quality Constitution (File 1: QUALITY.md) |
| 2 | + |
| 3 | +The quality constitution defines what "quality" means for this specific project and makes the bar explicit, persistent, and inherited by every AI session. |
| 4 | + |
| 5 | +## Template |
| 6 | + |
| 7 | +```markdown |
| 8 | +# Quality Constitution: [Project Name] |
| 9 | + |
| 10 | +## Purpose |
| 11 | + |
| 12 | +[2–3 paragraphs grounding quality in three principles:] |
| 13 | + |
| 14 | +- **Deming** ("quality is built in, not inspected in") — Quality is built into context files |
| 15 | + and the quality playbook so every AI session inherits the same bar. |
| 16 | +- **Juran** ("fitness for use") — Define fitness specifically for this project. Not "tests pass" |
| 17 | + but the actual real-world requirement. Example: "generates correct output that survives |
| 18 | + input schema changes without silently producing wrong results." |
| 19 | +- **Crosby** ("quality is free") — Building a quality playbook upfront costs less than |
| 20 | + debugging problems found after deployment. |
| 21 | + |
| 22 | +## Coverage Targets |
| 23 | + |
| 24 | +| Subsystem | Target | Why | |
| 25 | +|-----------|--------|-----| |
| 26 | +| [Most fragile module] | 90–95% | [Real edge case or past bug] | |
| 27 | +| [Core logic module] | 85–90% | [Concrete risk] | |
| 28 | +| [I/O or integration layer] | 80% | [Explain] | |
| 29 | +| [Configuration/utilities] | 75–80% | [Explain] | |
| 30 | + |
| 31 | +The rationale column is essential. It must reference specific risks or past failures. |
| 32 | +If you can't explain why a subsystem needs high coverage with a concrete example, |
| 33 | +the target is arbitrary. |
| 34 | + |
| 35 | +## Coverage Theater Prevention |
| 36 | + |
| 37 | +[Define what constitutes a fake test for this project.] |
| 38 | + |
| 39 | +Generic examples that apply to most projects: |
| 40 | +- Asserting a function returned *something* without checking what |
| 41 | +- Testing with synthetic data that lacks the quirks of real data |
| 42 | +- Asserting an import succeeded |
| 43 | +- Asserting mock returns what the mock was configured to return |
| 44 | +- Calling a function and only asserting no exception was thrown |
| 45 | + |
| 46 | +[Add project-specific examples based on what you learned during exploration. |
| 47 | +For a data pipeline: "counting output records without checking their values." |
| 48 | +For a web app: "checking HTTP 200 without checking the response body." |
| 49 | +For a compiler: "checking output compiles without checking behavior."] |
| 50 | + |
| 51 | +## Fitness-to-Purpose Scenarios |
| 52 | + |
| 53 | +[5–10 scenarios. Every scenario must include a `[Req: tier — source]` tag linking it to its requirement source. Use the template below:] |
| 54 | + |
| 55 | +### Scenario N: [Memorable Name] |
| 56 | + |
| 57 | +**Requirement tag:** [Req: formal — Spec §X] *(or `user-confirmed` / `inferred` — see SKILL.md Phase 1, Step 1 for tier definitions)* |
| 58 | + |
| 59 | +**What happened:** [The architectural vulnerability, edge case, or design decision. |
| 60 | +Reference actual code — function names, file names, line numbers. Frame as "this architecture permits the following failure mode."] |
| 61 | + |
| 62 | +**The requirement:** [What the code must do to prevent this failure. |
| 63 | +Be specific enough that an AI can verify it.] |
| 64 | + |
| 65 | +**How to verify:** [Concrete test or query that would fail if this regressed. |
| 66 | +Include exact commands, test names, or assertions.] |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +[Repeat for each scenario] |
| 71 | + |
| 72 | +## AI Session Quality Discipline |
| 73 | + |
| 74 | +1. Read QUALITY.md before starting work. |
| 75 | +2. Run the full test suite before marking any task complete. |
| 76 | +3. Add tests for new functionality (not just happy path — include edge cases). |
| 77 | +4. Update this file if new failure modes are discovered. |
| 78 | +5. Output a Quality Compliance Checklist before ending a session. |
| 79 | +6. Never remove a fitness-to-purpose scenario. Only add new ones. |
| 80 | + |
| 81 | +## The Human Gate |
| 82 | + |
| 83 | +[List things that require human judgment:] |
| 84 | +- Output that "looks right" (requires domain knowledge) |
| 85 | +- UX and responsiveness |
| 86 | +- Documentation accuracy |
| 87 | +- Security review of auth changes |
| 88 | +- Backward compatibility decisions |
| 89 | +``` |
| 90 | + |
| 91 | +## Where Scenarios Come From |
| 92 | + |
| 93 | +Scenarios come from two sources — **code exploration** and **domain knowledge** — and the best scenarios combine both. |
| 94 | + |
| 95 | +### Source 1: Defensive Code Patterns (Code Exploration) |
| 96 | + |
| 97 | +Every defensive pattern is evidence of a past failure or known risk: |
| 98 | + |
| 99 | +1. **Defensive code** — Every `if value is None: return` guard is a scenario. Why was it needed? |
| 100 | +2. **Normalization functions** — Every function that cleans input exists because raw input caused problems |
| 101 | +3. **Configuration that could be hardcoded** — If a value is read from config instead of hardcoded, someone learned the value varies |
| 102 | +4. **Git blame / commit messages** — "Fix crash when X is missing" → Scenario: X can be missing |
| 103 | +5. **Comments explaining "why"** — "We use hash(id) not sequential index because..." → Scenario about correctness under that constraint |
| 104 | + |
| 105 | +### Source 2: What Could Go Wrong (Domain Knowledge) |
| 106 | + |
| 107 | +Don't limit yourself to what the code already defends against. Use your knowledge of similar systems to generate realistic failure scenarios that the code **should** handle. For every major subsystem, ask: |
| 108 | + |
| 109 | +- "What happens if this process is killed mid-operation?" (state machines, file I/O, batch processing) |
| 110 | +- "What happens if external input is subtly wrong?" (validation pipelines, API integrations) |
| 111 | +- "What happens if this runs at 10x scale?" (batch processing, databases, queues) |
| 112 | +- "What happens if two operations overlap?" (concurrency, file locks, shared state) |
| 113 | +- "What produces correct-looking output that is actually wrong?" (randomness, statistical operations, type coercion) |
| 114 | + |
| 115 | +These are not hypothetical — they are things that happen to every system of this type. Write them as **architectural vulnerability analyses**: "Because `save_state()` lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention. At scale (9,240 records across 64 batches), this pattern risks silent loss of 1,693+ records with nothing to flag them as missing." Concrete numbers and specific consequences make scenarios authoritative and non-negotiable. An AI session reading "records can be lost" will argue the standard down. An AI session reading a specific failure mode with quantified impact will not. |
| 116 | + |
| 117 | +### The Narrative Voice |
| 118 | + |
| 119 | +Each scenario's "What happened" must read like an architectural vulnerability analysis, not an abstract specification. Include: |
| 120 | + |
| 121 | +- **Specific quantities** — "308 records across 64 batches" not "some records" |
| 122 | +- **Cascade consequences** — "cascading through all subsequent pipeline steps, requiring reprocessing of 4,300 records instead of 308" |
| 123 | +- **Detection difficulty** — "nothing would flag them as missing" or "only statistical verification would catch it" |
| 124 | +- **Root cause in code** — "`random.seed(index)` creates correlated sequences because sequential integers produce related random streams" |
| 125 | + |
| 126 | +The narrative voice serves a critical purpose: it makes standards non-negotiable. Abstract requirements ("records should not be lost") invite rationalization. Specific failure modes with quantified impact ("a mid-batch crash silently loses 1,693 records with no detection mechanism") do not. Frame these as "this architecture permits the following failure" — grounded in the actual code, not fabricated as past incidents. |
| 127 | + |
| 128 | +### Combining Both Sources |
| 129 | + |
| 130 | +The strongest scenarios combine a defensive pattern found in code with domain knowledge about why it matters: |
| 131 | + |
| 132 | +1. Find the defensive code: `save_state()` writes to a temp file then renames |
| 133 | +2. Ask what failure this prevents: mid-write crash leaves corrupted state file |
| 134 | +3. Write the scenario as a vulnerability analysis: "Without the atomic rename pattern, a crash mid-write leaves state.json 50% complete. The next run gets JSONDecodeError and cannot resume without manual intervention." |
| 135 | +4. Ground it in code: "Read persistence.py line ~340: verify temp file + rename pattern" |
| 136 | + |
| 137 | +### The "Why" Requirement |
| 138 | + |
| 139 | +Every coverage target, every quality gate, every standard must have a "why" that references a specific scenario or risk. Without rationale, a future AI session will optimize for speed and argue the standard down. |
| 140 | + |
| 141 | +Bad: "Core logic: 100% coverage" |
| 142 | +Good: "Core logic: 100% — because `random.seed(index)` created correlated sequences that produced 77.5% bias instead of 50/50. Subtle bugs here produce plausible-but-wrong output. Only statistical verification catches them." |
| 143 | + |
| 144 | +The "why" is not documentation — it is protection against erosion. |
| 145 | + |
| 146 | +## Calibrating Scenario Count |
| 147 | + |
| 148 | +Aim for 2+ scenarios per core module (the modules identified as most complex or fragile). For a medium-sized project, this typically yields 8–10 scenarios. Fewer is fine for small projects; more for complex ones. If you're finding very few scenarios, it usually means the exploration was shallow rather than the project being simple — go back and read function bodies more carefully. Quality matters more than count: one scenario that precisely captures an architectural vulnerability is worth more than three generic "what if the input is bad" scenarios. |
| 149 | + |
| 150 | +## Self-Critique Before Finishing |
| 151 | + |
| 152 | +After drafting all scenarios, review each one and ask: |
| 153 | + |
| 154 | +1. **"Would an AI session argue this standard down?"** If yes, the "why" isn't concrete enough. Add numbers, consequences, and detection difficulty. |
| 155 | +2. **"Does the 'What happened' read like a vulnerability analysis or an abstract spec?"** If it reads like a spec, rewrite it with specific quantities, cascading consequences, and grounding in actual code. |
| 156 | +3. **"Is there a scenario I'm not seeing?"** Think about what a different AI model would flag. Architecture models catch data flow problems. Edge-case models catch boundary conditions. What are you blind to? |
| 157 | + |
| 158 | +## Critical Rule |
| 159 | + |
| 160 | +Each scenario's "How to verify" section must map to at least one automated test in the functional test file. If a scenario can't be automated, note why (it may require the Human Gate) — but most scenarios should be testable. |
0 commit comments