Skip to content

Commit df0ed6a

Browse files
yiouliCopilot
andauthored
update eval-driven-dev skill. (#1201)
* update eval-driven-dev skill. Split SKILL into multi-level to keep the skill body under 500 lines, rewrite instructions. * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent ac30511 commit df0ed6a

9 files changed

Lines changed: 1659 additions & 803 deletions

File tree

docs/README.skills.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
120120
| [ef-core](../skills/ef-core/SKILL.md) | Get best practices for Entity Framework Core | None |
121121
| [email-drafter](../skills/email-drafter/SKILL.md) | Draft and review professional emails that match your personal writing style. Analyzes your sent emails for tone, greeting, structure, and sign-off patterns via WorkIQ, then generates context-aware drafts for any recipient. USE FOR: draft email, write email, compose email, reply email, follow-up email, analyze email tone, email style. | None |
122122
| [entra-agent-user](../skills/entra-agent-user/SKILL.md) | Create Agent Users in Microsoft Entra ID from Agent Identities, enabling AI agents to act as digital workers with user identity capabilities in Microsoft 365 and Azure environments. | None |
123-
| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Add instrumentation, build golden datasets, write eval-based tests, run them, root-cause failures, and iterate — Ensure your Python LLM application works correctly. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM. Use for making sure an LLM application works correctly, catching regressions after prompt changes, fixing unexpected behavior, or validating output quality before shipping. | `references/pixie-api.md` |
123+
| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/dataset-generation.md`<br />`references/eval-tests.md`<br />`references/instrumentation.md`<br />`references/investigation.md`<br />`references/pixie-api.md`<br />`references/run-harness-patterns.md`<br />`references/understanding-app.md` |
124124
| [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md`<br />`references/excalidraw-schema.md`<br />`scripts/.gitignore`<br />`scripts/README.md`<br />`scripts/add-arrow.py`<br />`scripts/add-icon-to-diagram.py`<br />`scripts/split-excalidraw-library.py`<br />`templates` |
125125
| [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices. | `references/getdata.md`<br />`references/pyspark.md` |
126126
| [fedora-linux-triage](../skills/fedora-linux-triage/SKILL.md) | Triage and resolve Fedora issues with dnf, systemd, and SELinux-aware guidance. | None |

skills/eval-driven-dev/SKILL.md

Lines changed: 217 additions & 701 deletions
Large diffs are not rendered by default.
Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
# Dataset Generation
2+
3+
This reference covers Step 4 of the eval-driven-dev process: creating the eval dataset.
4+
5+
For full `DatasetStore`, `Evaluable`, and CLI command signatures, see `references/pixie-api.md` (Dataset Python API and CLI Commands sections).
6+
7+
---
8+
9+
## What a dataset contains
10+
11+
A dataset is a collection of `Evaluable` items. Each item has:
12+
13+
- **`eval_input`**: Made-up application input + data from external dependencies. This is what the utility function from Step 3 feeds into the app at test time.
14+
- **`expected_output`**: Case-specific evaluation reference (optional). The meaning depends on the evaluator — it could be an exact answer, a factual reference, or quality criteria text.
15+
- **`eval_output`**: **NOT stored in the dataset.** Produced at test time when the utility function replays the eval_input through the real app.
16+
17+
The dataset is made up by you based on the data shapes observed in the reference trace from Step 2. You are NOT extracting data from traces — you are crafting realistic test scenarios.
18+
19+
---
20+
21+
## Creating the dataset
22+
23+
### CLI
24+
25+
```bash
26+
pixie dataset create <dataset-name>
27+
pixie dataset list # verify it exists
28+
```
29+
30+
### Python API
31+
32+
```python
33+
from pixie import DatasetStore, Evaluable
34+
35+
store = DatasetStore()
36+
store.create("qa-golden-set", items=[
37+
Evaluable(
38+
eval_input={"user_message": "What are your hours?", "customer_profile": {"name": "Alice", "tier": "gold"}},
39+
expected_output="Response should mention Monday-Friday 9am-5pm and Saturday 10am-2pm",
40+
),
41+
Evaluable(
42+
eval_input={"user_message": "I need to cancel my order", "customer_profile": {"name": "Bob", "tier": "basic"}},
43+
expected_output="Should confirm which order and explain the cancellation policy",
44+
),
45+
])
46+
```
47+
48+
Or build incrementally:
49+
50+
```python
51+
store = DatasetStore()
52+
store.create("qa-golden-set")
53+
for item in items:
54+
store.append("qa-golden-set", item)
55+
```
56+
57+
---
58+
59+
## Crafting eval_input items
60+
61+
Each eval_input must match the **exact data shape** from the reference trace. Look at what the `@observe`-decorated function received as input in Step 2 — same field names, same types, same nesting.
62+
63+
### What goes into eval_input
64+
65+
| Data category | Example | Source |
66+
| ------------------------ | ------------------------------------------------- | --------------------------------------------------- |
67+
| Application input | User message, query, request body | What a real user would send |
68+
| External dependency data | Customer profile, retrieved documents, DB records | Made up to match the shape from the reference trace |
69+
| Conversation history | Previous messages in a chat | Made up to set up the scenario |
70+
| Configuration / context | Feature flags, session state | Whatever the function expects as arguments |
71+
72+
### Matching the reference trace shape
73+
74+
From the reference trace (`pixie trace last`), note:
75+
76+
1. **Field names** — use the exact same keys (e.g., `user_message` not `message`, `customer_profile` not `profile`)
77+
2. **Types** — if the trace shows a list, use a list; if it shows a nested dict, use a nested dict
78+
3. **Realistic values** — the data should look like something the app would actually receive. Don't use placeholder text like "test input" or "lorem ipsum"
79+
80+
**Example**: If the reference trace shows the function received:
81+
82+
```json
83+
{
84+
"user_message": "I'd like to reschedule my appointment",
85+
"customer_profile": {
86+
"name": "Jane Smith",
87+
"account_id": "A12345",
88+
"tier": "premium"
89+
},
90+
"conversation_history": [
91+
{ "role": "assistant", "content": "Welcome! How can I help you today?" }
92+
]
93+
}
94+
```
95+
96+
Then every eval_input you make up must have `user_message` (string), `customer_profile` (dict with `name`, `account_id`, `tier`), and `conversation_history` (list of message dicts).
97+
98+
---
99+
100+
## Setting expected_output
101+
102+
`expected_output` is a **reference for evaluation** — its meaning depends on which evaluator will consume it.
103+
104+
### When to set it
105+
106+
| Scenario | expected_output value | Evaluator it pairs with |
107+
| ------------------------------------------- | -------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
108+
| Deterministic answer exists | The exact answer: `"Paris"` | `ExactMatchEval`, `FactualityEval`, `ClosedQAEval` |
109+
| Open-ended but has quality criteria | Description of good output: `"Should mention Saturday hours and be under 2 sentences"` | `create_llm_evaluator` with `{expected_output}` in prompt |
110+
| Truly open-ended, no case-specific criteria | Leave as `"UNSET"` or omit | Standalone evaluators (`PossibleEval`, `FaithfulnessEval`) |
111+
112+
### Universal vs. case-specific criteria
113+
114+
- **Universal criteria** apply to ALL test cases → implement in the test function's evaluators (e.g., "responses must be concise", "must not hallucinate"). These don't need expected_output.
115+
- **Case-specific criteria** vary per test case → carry as `expected_output` in the dataset item (e.g., "should mention the caller's Tuesday appointment", "should route to billing").
116+
117+
### Anti-patterns
118+
119+
- **Don't generate both eval_output and expected_output from the same source.** If they're identical and you use `ExactMatchEval`, the test is circular and catches zero regressions.
120+
- **Don't use comparison evaluators (`FactualityEval`, `ClosedQAEval`, `ExactMatchEval`) on items without expected_output.** They produce meaningless scores.
121+
- **Don't mix expected_output semantics in one dataset.** If some items use expected_output as a factual answer and others as style guidance, evaluators can't handle both. Split into separate datasets or use separate test functions.
122+
123+
---
124+
125+
## Validating the dataset
126+
127+
After creating the dataset, check:
128+
129+
### 1. Structural validation
130+
131+
Every eval_input must match the reference trace's schema:
132+
133+
- Same fields present
134+
- Same types (string, int, list, dict)
135+
- Same nesting depth
136+
- No extra or missing fields compared to what the function expects
137+
138+
### 2. Semantic validation
139+
140+
- **Realistic values** — names, messages, and data look like real-world inputs, not test placeholders
141+
- **Coherent scenarios** — if there's conversation history, it should make topical sense with the user message
142+
- **External dependency data makes sense** — customer profiles have realistic account IDs, retrieved documents are plausible
143+
144+
### 3. Diversity validation
145+
146+
- Items have **meaningfully different** inputs — different user intents, different customer types, different edge cases
147+
- Not just minor variations of the same scenario (e.g., don't have 5 items that are all "What are your hours?" with different names)
148+
- Cover: normal cases, edge cases, things the app might plausibly get wrong
149+
150+
### 4. Expected_output validation
151+
152+
- case-specific `expected_output` values are specific and testable, not vague
153+
- Items where expected_output is universal don't redundantly carry expected_output
154+
155+
### 5. Verify by listing
156+
157+
```bash
158+
pixie dataset list
159+
```
160+
161+
Or in the build script:
162+
163+
```python
164+
ds = store.get("qa-golden-set")
165+
print(f"Dataset has {len(ds.items)} items")
166+
for i, item in enumerate(ds.items):
167+
print(f" [{i}] input keys: {list(item.eval_input.keys()) if isinstance(item.eval_input, dict) else type(item.eval_input)}")
168+
print(f" expected_output: {item.expected_output[:80] if item.expected_output != 'UNSET' else 'UNSET'}...")
169+
```
170+
171+
---
172+
173+
## Recommended build_dataset.py structure
174+
175+
Put the build script at `pixie_qa/scripts/build_dataset.py`:
176+
177+
```python
178+
"""Build the eval dataset with made-up scenarios.
179+
180+
Each eval_input matches the data shape from the reference trace (Step 2).
181+
Run this script to create/recreate the dataset.
182+
"""
183+
from pixie import DatasetStore, Evaluable
184+
185+
DATASET_NAME = "qa-golden-set"
186+
187+
def build() -> None:
188+
store = DatasetStore()
189+
190+
# Recreate fresh
191+
try:
192+
store.delete(DATASET_NAME)
193+
except FileNotFoundError:
194+
pass
195+
store.create(DATASET_NAME)
196+
197+
items = [
198+
# Normal case — straightforward question
199+
Evaluable(
200+
eval_input={
201+
"user_message": "What are your business hours?",
202+
"customer_profile": {"name": "Alice Johnson", "account_id": "C100", "tier": "gold"},
203+
},
204+
expected_output="Should mention Mon-Fri 9am-5pm and Sat 10am-2pm",
205+
),
206+
# Edge case — ambiguous request
207+
Evaluable(
208+
eval_input={
209+
"user_message": "I want to change something",
210+
"customer_profile": {"name": "Bob Smith", "account_id": "C200", "tier": "basic"},
211+
},
212+
expected_output="Should ask for clarification about what to change",
213+
),
214+
# ... more items covering different scenarios
215+
]
216+
217+
for item in items:
218+
store.append(DATASET_NAME, item)
219+
220+
# Verify
221+
ds = store.get(DATASET_NAME)
222+
print(f"Dataset '{DATASET_NAME}' has {len(ds.items)} items")
223+
for i, entry in enumerate(ds.items):
224+
keys = list(entry.eval_input.keys()) if isinstance(entry.eval_input, dict) else type(entry.eval_input)
225+
print(f" [{i}] input keys: {keys}")
226+
227+
if __name__ == "__main__":
228+
build()
229+
```
230+
231+
---
232+
233+
## The cardinal rule
234+
235+
**`eval_output` is always produced at test time, never stored in the dataset.** The dataset contains `eval_input` (made-up input matching the reference trace shape) and optionally `expected_output` (the reference to judge against). The test's `runnable` function produces `eval_output` by replaying `eval_input` through the real app.

0 commit comments

Comments
 (0)