Skip to content

Commit 5f59ddb

Browse files
authored
update eval-driven-dev skill (#1352)
* update eval-driven-dev skill * small refinement of skill description * address review, rerun npm start.
1 parent 88b1920 commit 5f59ddb

19 files changed

Lines changed: 2179 additions & 1707 deletions

docs/README.skills.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
131131
| [ef-core](../skills/ef-core/SKILL.md) | Get best practices for Entity Framework Core | None |
132132
| [email-drafter](../skills/email-drafter/SKILL.md) | Draft and review professional emails that match your personal writing style. Analyzes your sent emails for tone, greeting, structure, and sign-off patterns via WorkIQ, then generates context-aware drafts for any recipient. USE FOR: draft email, write email, compose email, reply email, follow-up email, analyze email tone, email style. | None |
133133
| [entra-agent-user](../skills/entra-agent-user/SKILL.md) | Create Agent Users in Microsoft Entra ID from Agent Identities, enabling AI agents to act as digital workers with user identity capabilities in Microsoft 365 and Azure environments. | None |
134-
| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/dataset-generation.md`<br />`references/eval-tests.md`<br />`references/instrumentation.md`<br />`references/investigation.md`<br />`references/pixie-api.md`<br />`references/run-harness-patterns.md`<br />`references/understanding-app.md` |
134+
| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/1-a-entry-point.md`<br />`references/1-b-eval-criteria.md`<br />`references/2-wrap-and-trace.md`<br />`references/3-define-evaluators.md`<br />`references/4-build-dataset.md`<br />`references/5-run-tests.md`<br />`references/6-investigate.md`<br />`references/evaluators.md`<br />`references/testing-api.md`<br />`references/wrap-api.md`<br />`resources` |
135135
| [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md`<br />`references/excalidraw-schema.md`<br />`scripts/.gitignore`<br />`scripts/README.md`<br />`scripts/add-arrow.py`<br />`scripts/add-icon-to-diagram.py`<br />`scripts/split-excalidraw-library.py`<br />`templates` |
136136
| [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices. | `references/getdata.md`<br />`references/pyspark.md` |
137137
| [fedora-linux-triage](../skills/fedora-linux-triage/SKILL.md) | Triage and resolve Fedora issues with dnf, systemd, and SELinux-aware guidance. | None |

skills/eval-driven-dev/SKILL.md

Lines changed: 56 additions & 287 deletions
Large diffs are not rendered by default.
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Step 1a: Entry Point & Execution Flow
2+
3+
Identify how the application starts and how a real user invokes it.
4+
5+
---
6+
7+
## What to investigate
8+
9+
### 1. How the software runs
10+
11+
What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
12+
13+
Look for:
14+
15+
- `if __name__ == "__main__"` blocks
16+
- Framework entry points (FastAPI `app`, Flask `app`, Django `manage.py`)
17+
- CLI entry points in `pyproject.toml` (`[project.scripts]`)
18+
- Docker/compose configs that reveal startup commands
19+
20+
### 2. The real user entry point
21+
22+
How does a real user or client invoke the app? This is what the eval must exercise — not an inner function that bypasses the request pipeline.
23+
24+
- **Web server**: Which HTTP endpoints accept user input? What methods (GET/POST)? What request body shape?
25+
- **CLI**: What command-line arguments does the user provide?
26+
- **Library/function**: What function does the caller import and call? What arguments?
27+
28+
### 3. Environment and configuration
29+
30+
- What env vars does the app require? (API keys, database URLs, feature flags)
31+
- What config files does it read?
32+
- What has sensible defaults vs. what must be explicitly set?
33+
34+
---
35+
36+
## Output: `pixie_qa/01-entry-point.md`
37+
38+
Write your findings to this file. Keep it focused — only entry point and execution flow.
39+
40+
### Template
41+
42+
```markdown
43+
# Entry Point & Execution Flow
44+
45+
## How to run
46+
47+
<Command to start the app, required env vars, config files>
48+
49+
## Entry point
50+
51+
- **File**: <e.g., app.py, main.py>
52+
- **Type**: <FastAPI server / CLI / standalone function / etc.>
53+
- **Framework**: <FastAPI, Flask, Django, none>
54+
55+
## User-facing endpoints / interface
56+
57+
<For each way a user interacts with the app:>
58+
59+
- **Endpoint / command**: <e.g., POST /chat, python main.py --query "...">
60+
- **Input format**: <request body shape, CLI args, function params>
61+
- **Output format**: <response shape, stdout format, return type>
62+
63+
## Environment requirements
64+
65+
| Variable | Purpose | Required? | Default |
66+
| -------- | ------- | --------- | ------- |
67+
| ... | ... | ... | ... |
68+
```
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Step 1b: Eval Criteria
2+
3+
Define what quality dimensions matter for this app — based on the entry point (`01-entry-point.md`) you've already documented.
4+
5+
This document serves two purposes:
6+
7+
1. **Dataset creation (Step 4)**: The use cases tell you what kinds of items to generate — each use case should have representative items in the dataset.
8+
2. **Evaluator selection (Step 3)**: The eval criteria tell you what evaluators to choose and how to map them.
9+
10+
Keep this concise — it's a planning artifact, not a comprehensive spec.
11+
12+
---
13+
14+
## What to define
15+
16+
### 1. Use cases
17+
18+
List the distinct scenarios the app handles. Each use case becomes a category of dataset items. **Each use case description must be a concise one-liner that conveys both (a) what the input is and (b) what the expected behavior or outcome is.** The description should be specific enough that someone unfamiliar with the app can understand the scenario and its success criteria.
19+
20+
**Good use case descriptions:**
21+
22+
- "Reroute to human agent on account lookup difficulties"
23+
- "Answer billing question using customer's plan details from CRM"
24+
- "Decline to answer questions outside the support domain"
25+
- "Summarize research findings including all queried sub-topics"
26+
27+
**Bad use case descriptions (too vague):**
28+
29+
- "Handle billing questions"
30+
- "Edge case"
31+
- "Error handling"
32+
33+
### 2. Eval criteria
34+
35+
Define **high-level, application-specific eval criteria** — quality dimensions that matter for THIS app. Each criterion will map to an evaluator in Step 3.
36+
37+
**Good criteria are specific to the app's purpose.** Examples:
38+
39+
- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation?"
40+
- Research report generator: "Does the report address all sub-questions?", "Are claims supported by retrieved sources?"
41+
- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when context is missing?"
42+
43+
**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.
44+
45+
At this stage, don't pick evaluator classes or thresholds. That comes in Step 3.
46+
47+
### 3. Check criteria applicability and observability
48+
49+
For each criterion:
50+
51+
1. **Determine applicability scope** — does this criterion apply to ALL use cases, or only a subset? If a criterion is only relevant for certain scenarios (e.g., "identity verification" only applies to account-related requests, not general FAQ), mark it clearly. This distinction is critical for Step 4 (dataset creation) because:
52+
- **Universal criteria** → become dataset-level default evaluators
53+
- **Case-specific criteria** → become item-level evaluators on relevant rows only
54+
55+
2. **Verify observability** — for each criterion, identify what data point in the app needs to be captured as a `wrap()` call to evaluate it. This drives the wrap coverage in Step 2.
56+
- If the criterion is about the app's final response → captured by `wrap(purpose="output", name="response")`
57+
- If it's about a routing decision → captured by `wrap(purpose="state", name="routing_decision")`
58+
- If it's about data the app fetched and used → captured by `wrap(purpose="input", name="...")`
59+
60+
---
61+
62+
## Output: `pixie_qa/02-eval-criteria.md`
63+
64+
Write your findings to this file. **Keep it short** — the template below is the maximum length.
65+
66+
### Template
67+
68+
```markdown
69+
# Eval Criteria
70+
71+
## Use cases
72+
73+
1. <Use case name>: <one-liner conveying input + expected behavior>
74+
2. ...
75+
76+
## Eval criteria
77+
78+
| # | Criterion | Applies to | Data to capture |
79+
| --- | --------- | ------------- | --------------- |
80+
| 1 | ... | All | wrap name: ... |
81+
| 2 | ... | Use case 1, 3 | wrap name: ... |
82+
```

0 commit comments

Comments
 (0)