Skip to content

Commit 50f87bd

Browse files
Add quality-playbook skill (#1168)
1 parent 2520565 commit 50f87bd

10 files changed

Lines changed: 2081 additions & 0 deletions

File tree

docs/README.skills.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
218218
| [publish-to-pages](../skills/publish-to-pages/SKILL.md) | Publish presentations and web content to GitHub Pages. Converts PPTX, PDF, HTML, or Google Slides to a live GitHub Pages URL. Handles repo creation, file conversion, Pages enablement, and returns the live URL. Use when the user wants to publish, deploy, or share a presentation or HTML file via GitHub Pages. | `scripts/convert-pdf.py`<br />`scripts/convert-pptx.py`<br />`scripts/publish.sh` |
219219
| [pytest-coverage](../skills/pytest-coverage/SKILL.md) | Run pytest tests with coverage, discover lines missing coverage, and increase coverage to 100%. | None |
220220
| [python-mcp-server-generator](../skills/python-mcp-server-generator/SKILL.md) | Generate a complete MCP server project in Python with tools, resources, and proper configuration | None |
221+
| [quality-playbook](../skills/quality-playbook/SKILL.md) | Explore any codebase from scratch and generate six quality artifacts: a quality constitution (QUALITY.md), spec-traced functional tests, a code review protocol, an integration testing protocol, a multi-model spec audit (Council of Three), and an AI bootstrap file (AGENTS.md). Works with any language (Python, Java, Scala, TypeScript, Go, Rust, etc.). Use this skill whenever the user asks to set up a quality playbook, generate functional tests from specifications, create a quality constitution, build testing protocols, audit code against specs, or establish a repeatable quality system for a project. Also trigger when the user mentions 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', 'coverage theater', or wants to go beyond basic test generation to build a full quality system grounded in their actual codebase. | `LICENSE.txt`<br />`references/constitution.md`<br />`references/defensive_patterns.md`<br />`references/functional_tests.md`<br />`references/review_protocols.md`<br />`references/schema_mapping.md`<br />`references/spec_audit.md`<br />`references/verification.md` |
221222
| [quasi-coder](../skills/quasi-coder/SKILL.md) | Expert 10x engineer skill for interpreting and implementing code from shorthand, quasi-code, and natural language descriptions. Use when collaborators provide incomplete code snippets, pseudo-code, or descriptions with potential typos or incorrect terminology. Excels at translating non-technical or semi-technical descriptions into production-quality code. | None |
222223
| [readme-blueprint-generator](../skills/readme-blueprint-generator/SKILL.md) | Intelligent README.md generation prompt that analyzes project documentation structure and creates comprehensive repository documentation. Scans .github/copilot directory files and copilot-instructions.md to extract project information, technology stack, architecture, development workflow, coding standards, and testing approaches while generating well-structured markdown documentation with proper formatting, cross-references, and developer-focused content. | None |
223224
| [refactor](../skills/refactor/SKILL.md) | Surgical code refactoring to improve maintainability without changing behavior. Covers extracting functions, renaming variables, breaking down god functions, improving type safety, eliminating code smells, and applying design patterns. Less drastic than repo-rebuilder; use for gradual improvements. | None |
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2025 Andrew Stellman
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

skills/quality-playbook/SKILL.md

Lines changed: 453 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Writing the Quality Constitution (File 1: QUALITY.md)
2+
3+
The quality constitution defines what "quality" means for this specific project and makes the bar explicit, persistent, and inherited by every AI session.
4+
5+
## Template
6+
7+
```markdown
8+
# Quality Constitution: [Project Name]
9+
10+
## Purpose
11+
12+
[2–3 paragraphs grounding quality in three principles:]
13+
14+
- **Deming** ("quality is built in, not inspected in") — Quality is built into context files
15+
and the quality playbook so every AI session inherits the same bar.
16+
- **Juran** ("fitness for use") — Define fitness specifically for this project. Not "tests pass"
17+
but the actual real-world requirement. Example: "generates correct output that survives
18+
input schema changes without silently producing wrong results."
19+
- **Crosby** ("quality is free") — Building a quality playbook upfront costs less than
20+
debugging problems found after deployment.
21+
22+
## Coverage Targets
23+
24+
| Subsystem | Target | Why |
25+
|-----------|--------|-----|
26+
| [Most fragile module] | 90–95% | [Real edge case or past bug] |
27+
| [Core logic module] | 85–90% | [Concrete risk] |
28+
| [I/O or integration layer] | 80% | [Explain] |
29+
| [Configuration/utilities] | 75–80% | [Explain] |
30+
31+
The rationale column is essential. It must reference specific risks or past failures.
32+
If you can't explain why a subsystem needs high coverage with a concrete example,
33+
the target is arbitrary.
34+
35+
## Coverage Theater Prevention
36+
37+
[Define what constitutes a fake test for this project.]
38+
39+
Generic examples that apply to most projects:
40+
- Asserting a function returned *something* without checking what
41+
- Testing with synthetic data that lacks the quirks of real data
42+
- Asserting an import succeeded
43+
- Asserting mock returns what the mock was configured to return
44+
- Calling a function and only asserting no exception was thrown
45+
46+
[Add project-specific examples based on what you learned during exploration.
47+
For a data pipeline: "counting output records without checking their values."
48+
For a web app: "checking HTTP 200 without checking the response body."
49+
For a compiler: "checking output compiles without checking behavior."]
50+
51+
## Fitness-to-Purpose Scenarios
52+
53+
[5–10 scenarios. Every scenario must include a `[Req: tier — source]` tag linking it to its requirement source. Use the template below:]
54+
55+
### Scenario N: [Memorable Name]
56+
57+
**Requirement tag:** [Req: formal — Spec §X] *(or `user-confirmed` / `inferred` — see SKILL.md Phase 1, Step 1 for tier definitions)*
58+
59+
**What happened:** [The architectural vulnerability, edge case, or design decision.
60+
Reference actual code — function names, file names, line numbers. Frame as "this architecture permits the following failure mode."]
61+
62+
**The requirement:** [What the code must do to prevent this failure.
63+
Be specific enough that an AI can verify it.]
64+
65+
**How to verify:** [Concrete test or query that would fail if this regressed.
66+
Include exact commands, test names, or assertions.]
67+
68+
---
69+
70+
[Repeat for each scenario]
71+
72+
## AI Session Quality Discipline
73+
74+
1. Read QUALITY.md before starting work.
75+
2. Run the full test suite before marking any task complete.
76+
3. Add tests for new functionality (not just happy path — include edge cases).
77+
4. Update this file if new failure modes are discovered.
78+
5. Output a Quality Compliance Checklist before ending a session.
79+
6. Never remove a fitness-to-purpose scenario. Only add new ones.
80+
81+
## The Human Gate
82+
83+
[List things that require human judgment:]
84+
- Output that "looks right" (requires domain knowledge)
85+
- UX and responsiveness
86+
- Documentation accuracy
87+
- Security review of auth changes
88+
- Backward compatibility decisions
89+
```
90+
91+
## Where Scenarios Come From
92+
93+
Scenarios come from two sources — **code exploration** and **domain knowledge** — and the best scenarios combine both.
94+
95+
### Source 1: Defensive Code Patterns (Code Exploration)
96+
97+
Every defensive pattern is evidence of a past failure or known risk:
98+
99+
1. **Defensive code** — Every `if value is None: return` guard is a scenario. Why was it needed?
100+
2. **Normalization functions** — Every function that cleans input exists because raw input caused problems
101+
3. **Configuration that could be hardcoded** — If a value is read from config instead of hardcoded, someone learned the value varies
102+
4. **Git blame / commit messages** — "Fix crash when X is missing" → Scenario: X can be missing
103+
5. **Comments explaining "why"** — "We use hash(id) not sequential index because..." → Scenario about correctness under that constraint
104+
105+
### Source 2: What Could Go Wrong (Domain Knowledge)
106+
107+
Don't limit yourself to what the code already defends against. Use your knowledge of similar systems to generate realistic failure scenarios that the code **should** handle. For every major subsystem, ask:
108+
109+
- "What happens if this process is killed mid-operation?" (state machines, file I/O, batch processing)
110+
- "What happens if external input is subtly wrong?" (validation pipelines, API integrations)
111+
- "What happens if this runs at 10x scale?" (batch processing, databases, queues)
112+
- "What happens if two operations overlap?" (concurrency, file locks, shared state)
113+
- "What produces correct-looking output that is actually wrong?" (randomness, statistical operations, type coercion)
114+
115+
These are not hypothetical — they are things that happen to every system of this type. Write them as **architectural vulnerability analyses**: "Because `save_state()` lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention. At scale (9,240 records across 64 batches), this pattern risks silent loss of 1,693+ records with nothing to flag them as missing." Concrete numbers and specific consequences make scenarios authoritative and non-negotiable. An AI session reading "records can be lost" will argue the standard down. An AI session reading a specific failure mode with quantified impact will not.
116+
117+
### The Narrative Voice
118+
119+
Each scenario's "What happened" must read like an architectural vulnerability analysis, not an abstract specification. Include:
120+
121+
- **Specific quantities** — "308 records across 64 batches" not "some records"
122+
- **Cascade consequences** — "cascading through all subsequent pipeline steps, requiring reprocessing of 4,300 records instead of 308"
123+
- **Detection difficulty** — "nothing would flag them as missing" or "only statistical verification would catch it"
124+
- **Root cause in code** — "`random.seed(index)` creates correlated sequences because sequential integers produce related random streams"
125+
126+
The narrative voice serves a critical purpose: it makes standards non-negotiable. Abstract requirements ("records should not be lost") invite rationalization. Specific failure modes with quantified impact ("a mid-batch crash silently loses 1,693 records with no detection mechanism") do not. Frame these as "this architecture permits the following failure" — grounded in the actual code, not fabricated as past incidents.
127+
128+
### Combining Both Sources
129+
130+
The strongest scenarios combine a defensive pattern found in code with domain knowledge about why it matters:
131+
132+
1. Find the defensive code: `save_state()` writes to a temp file then renames
133+
2. Ask what failure this prevents: mid-write crash leaves corrupted state file
134+
3. Write the scenario as a vulnerability analysis: "Without the atomic rename pattern, a crash mid-write leaves state.json 50% complete. The next run gets JSONDecodeError and cannot resume without manual intervention."
135+
4. Ground it in code: "Read persistence.py line ~340: verify temp file + rename pattern"
136+
137+
### The "Why" Requirement
138+
139+
Every coverage target, every quality gate, every standard must have a "why" that references a specific scenario or risk. Without rationale, a future AI session will optimize for speed and argue the standard down.
140+
141+
Bad: "Core logic: 100% coverage"
142+
Good: "Core logic: 100% — because `random.seed(index)` created correlated sequences that produced 77.5% bias instead of 50/50. Subtle bugs here produce plausible-but-wrong output. Only statistical verification catches them."
143+
144+
The "why" is not documentation — it is protection against erosion.
145+
146+
## Calibrating Scenario Count
147+
148+
Aim for 2+ scenarios per core module (the modules identified as most complex or fragile). For a medium-sized project, this typically yields 8–10 scenarios. Fewer is fine for small projects; more for complex ones. If you're finding very few scenarios, it usually means the exploration was shallow rather than the project being simple — go back and read function bodies more carefully. Quality matters more than count: one scenario that precisely captures an architectural vulnerability is worth more than three generic "what if the input is bad" scenarios.
149+
150+
## Self-Critique Before Finishing
151+
152+
After drafting all scenarios, review each one and ask:
153+
154+
1. **"Would an AI session argue this standard down?"** If yes, the "why" isn't concrete enough. Add numbers, consequences, and detection difficulty.
155+
2. **"Does the 'What happened' read like a vulnerability analysis or an abstract spec?"** If it reads like a spec, rewrite it with specific quantities, cascading consequences, and grounding in actual code.
156+
3. **"Is there a scenario I'm not seeing?"** Think about what a different AI model would flag. Architecture models catch data flow problems. Edge-case models catch boundary conditions. What are you blind to?
157+
158+
## Critical Rule
159+
160+
Each scenario's "How to verify" section must map to at least one automated test in the functional test file. If a scenario can't be automated, note why (it may require the Human Gate) — but most scenarios should be testable.
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Finding Defensive Patterns (Step 5)
2+
3+
Defensive code patterns are evidence of past failures or known risks. Every null guard, try/catch, normalization function, and sentinel check exists because something went wrong — or because someone anticipated it would. Your job is to find these patterns systematically and convert them into fitness-to-purpose scenarios and boundary tests.
4+
5+
## Systematic Search
6+
7+
Don't skim — grep the codebase methodically. The exact patterns depend on the project's language. Here are common defensive-code indicators grouped by what they protect against:
8+
9+
**Null/nil guards:**
10+
11+
| Language | Grep pattern |
12+
|---|---|
13+
| Python | `None`, `is None`, `is not None` |
14+
| Java | `null`, `Optional`, `Objects.requireNonNull` |
15+
| Scala | `Option`, `None`, `.getOrElse`, `.isEmpty` |
16+
| TypeScript | `undefined`, `null`, `??`, `?.` |
17+
| Go | `== nil`, `!= nil`, `if err != nil` |
18+
| Rust | `Option`, `unwrap`, `.is_none()`, `?` |
19+
20+
**Exception/error handling:**
21+
22+
| Language | Grep pattern |
23+
|---|---|
24+
| Python | `except`, `try:`, `raise` |
25+
| Java | `catch`, `throws`, `try {` |
26+
| Scala | `Try`, `catch`, `recover`, `Failure` |
27+
| TypeScript | `catch`, `throw`, `.catch(` |
28+
| Go | `if err != nil`, `errors.New`, `fmt.Errorf` |
29+
| Rust | `Result`, `Err(`, `unwrap_or`, `match` |
30+
31+
**Internal/private helpers (often defensive):**
32+
33+
| Language | Grep pattern |
34+
|---|---|
35+
| Python | `def _`, `__` |
36+
| Java/Scala | `private`, `protected` |
37+
| TypeScript | `private`, `#` (private fields) |
38+
| Go | lowercase function names (unexported) |
39+
| Rust | `pub(crate)`, non-`pub` functions |
40+
41+
**Sentinel values, fallbacks, boundary checks:** Search for `== 0`, `< 0`, `default`, `fallback`, `else`, `match`, `switch` — these are language-agnostic.
42+
43+
## What to Look For Beyond Grep
44+
45+
- **Bugs that were fixed** — Git history, TODO comments, workarounds, defensive code that checks for things that "shouldn't happen"
46+
- **Design decisions** — Comments explaining "why" not just "what." Configuration that could have been hardcoded but isn't. Abstractions that exist for a reason.
47+
- **External data quirks** — Any place the code normalizes, validates, or rejects input from an external system
48+
- **Parsing functions** — Every parser (regex, string splitting, format detection) has failure modes. What happens with malformed input? Empty input? Unexpected types?
49+
- **Boundary conditions** — Zero values, empty strings, maximum ranges, first/last elements, type boundaries
50+
51+
## Converting Findings to Scenarios
52+
53+
For each defensive pattern, ask: "What failure does this prevent? What input would trigger this code path?"
54+
55+
The answer becomes a fitness-to-purpose scenario:
56+
57+
```markdown
58+
### Scenario N: [Memorable Name]
59+
60+
**Requirement tag:** [Req: inferred — from function_name() behavior] *(use the canonical `[Req: tier — source]` format from SKILL.md Phase 1, Step 1)*
61+
62+
**What happened:** [The failure mode this code prevents. Reference the actual function, file, and line. Frame as a vulnerability analysis, not a fabricated incident.]
63+
64+
**The requirement:** [What the code must do to prevent this failure.]
65+
66+
**How to verify:** [A concrete test that would fail if this regressed.]
67+
```
68+
69+
## Converting Findings to Boundary Tests
70+
71+
Each defensive pattern also maps to a boundary test:
72+
73+
```python
74+
# Python (pytest)
75+
def test_defensive_pattern_name(fixture):
76+
"""[Req: inferred — from function_name() guard] guards against X."""
77+
# Mutate fixture to trigger the defensive code path
78+
# Assert the system handles it gracefully
79+
```
80+
81+
```java
82+
// Java (JUnit 5)
83+
@Test
84+
@DisplayName("[Req: inferred — from methodName() guard] guards against X")
85+
void testDefensivePatternName() {
86+
fixture.setField(null); // Trigger defensive code path
87+
var result = process(fixture);
88+
assertNotNull(result); // Assert graceful handling
89+
}
90+
```
91+
92+
```scala
93+
// Scala (ScalaTest)
94+
// [Req: inferred — from methodName() guard]
95+
"defensive pattern: methodName()" should "guard against X" in {
96+
val input = fixture.copy(field = None) // Trigger defensive code path
97+
val result = process(input)
98+
result should equal (defined) // Assert graceful handling
99+
}
100+
```
101+
102+
```typescript
103+
// TypeScript (Jest)
104+
test('[Req: inferred — from functionName() guard] guards against X', () => {
105+
const input = { ...fixture, field: null }; // Trigger defensive code path
106+
const result = process(input);
107+
expect(result).toBeDefined(); // Assert graceful handling
108+
});
109+
```
110+
111+
```go
112+
// Go (testing)
113+
func TestDefensivePatternName(t *testing.T) {
114+
// [Req: inferred — from FunctionName() guard] guards against X
115+
t.Helper()
116+
fixture.Field = nil // Trigger defensive code path
117+
result, err := Process(fixture)
118+
if err != nil {
119+
t.Fatalf("expected graceful handling, got error: %v", err)
120+
}
121+
// Assert the system handled it
122+
}
123+
```
124+
125+
```rust
126+
// Rust (cargo test)
127+
#[test]
128+
fn test_defensive_pattern_name() {
129+
// [Req: inferred — from function_name() guard] guards against X
130+
let input = Fixture { field: None, ..default_fixture() };
131+
let result = process(&input);
132+
assert!(result.is_ok(), "expected graceful handling");
133+
}
134+
```
135+
136+
## Minimum Bar
137+
138+
You should find at least 2–3 defensive patterns per source file in the core logic modules. If you find fewer, read function bodies more carefully — not just signatures and comments.
139+
140+
For a medium-sized project (5–15 source files), expect to find 15–30 defensive patterns total. Each one should produce at least one boundary test.

0 commit comments

Comments
 (0)