|
| 1 | +# Dataset Generation |
| 2 | + |
| 3 | +This reference covers Step 4 of the eval-driven-dev process: creating the eval dataset. |
| 4 | + |
| 5 | +For full `DatasetStore`, `Evaluable`, and CLI command signatures, see `references/pixie-api.md` (Dataset Python API and CLI Commands sections). |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## What a dataset contains |
| 10 | + |
| 11 | +A dataset is a collection of `Evaluable` items. Each item has: |
| 12 | + |
| 13 | +- **`eval_input`**: Made-up application input + data from external dependencies. This is what the utility function from Step 3 feeds into the app at test time. |
| 14 | +- **`expected_output`**: Case-specific evaluation reference (optional). The meaning depends on the evaluator — it could be an exact answer, a factual reference, or quality criteria text. |
| 15 | +- **`eval_output`**: **NOT stored in the dataset.** Produced at test time when the utility function replays the eval_input through the real app. |
| 16 | + |
| 17 | +The dataset is made up by you based on the data shapes observed in the reference trace from Step 2. You are NOT extracting data from traces — you are crafting realistic test scenarios. |
| 18 | + |
| 19 | +--- |
| 20 | + |
| 21 | +## Creating the dataset |
| 22 | + |
| 23 | +### CLI |
| 24 | + |
| 25 | +```bash |
| 26 | +pixie dataset create <dataset-name> |
| 27 | +pixie dataset list # verify it exists |
| 28 | +``` |
| 29 | + |
| 30 | +### Python API |
| 31 | + |
| 32 | +```python |
| 33 | +from pixie import DatasetStore, Evaluable |
| 34 | + |
| 35 | +store = DatasetStore() |
| 36 | +store.create("qa-golden-set", items=[ |
| 37 | + Evaluable( |
| 38 | + eval_input={"user_message": "What are your hours?", "customer_profile": {"name": "Alice", "tier": "gold"}}, |
| 39 | + expected_output="Response should mention Monday-Friday 9am-5pm and Saturday 10am-2pm", |
| 40 | + ), |
| 41 | + Evaluable( |
| 42 | + eval_input={"user_message": "I need to cancel my order", "customer_profile": {"name": "Bob", "tier": "basic"}}, |
| 43 | + expected_output="Should confirm which order and explain the cancellation policy", |
| 44 | + ), |
| 45 | +]) |
| 46 | +``` |
| 47 | + |
| 48 | +Or build incrementally: |
| 49 | + |
| 50 | +```python |
| 51 | +store = DatasetStore() |
| 52 | +store.create("qa-golden-set") |
| 53 | +for item in items: |
| 54 | + store.append("qa-golden-set", item) |
| 55 | +``` |
| 56 | + |
| 57 | +--- |
| 58 | + |
| 59 | +## Crafting eval_input items |
| 60 | + |
| 61 | +Each eval_input must match the **exact data shape** from the reference trace. Look at what the `@observe`-decorated function received as input in Step 2 — same field names, same types, same nesting. |
| 62 | + |
| 63 | +### What goes into eval_input |
| 64 | + |
| 65 | +| Data category | Example | Source | |
| 66 | +| ------------------------ | ------------------------------------------------- | --------------------------------------------------- | |
| 67 | +| Application input | User message, query, request body | What a real user would send | |
| 68 | +| External dependency data | Customer profile, retrieved documents, DB records | Made up to match the shape from the reference trace | |
| 69 | +| Conversation history | Previous messages in a chat | Made up to set up the scenario | |
| 70 | +| Configuration / context | Feature flags, session state | Whatever the function expects as arguments | |
| 71 | + |
| 72 | +### Matching the reference trace shape |
| 73 | + |
| 74 | +From the reference trace (`pixie trace last`), note: |
| 75 | + |
| 76 | +1. **Field names** — use the exact same keys (e.g., `user_message` not `message`, `customer_profile` not `profile`) |
| 77 | +2. **Types** — if the trace shows a list, use a list; if it shows a nested dict, use a nested dict |
| 78 | +3. **Realistic values** — the data should look like something the app would actually receive. Don't use placeholder text like "test input" or "lorem ipsum" |
| 79 | + |
| 80 | +**Example**: If the reference trace shows the function received: |
| 81 | + |
| 82 | +```json |
| 83 | +{ |
| 84 | + "user_message": "I'd like to reschedule my appointment", |
| 85 | + "customer_profile": { |
| 86 | + "name": "Jane Smith", |
| 87 | + "account_id": "A12345", |
| 88 | + "tier": "premium" |
| 89 | + }, |
| 90 | + "conversation_history": [ |
| 91 | + { "role": "assistant", "content": "Welcome! How can I help you today?" } |
| 92 | + ] |
| 93 | +} |
| 94 | +``` |
| 95 | + |
| 96 | +Then every eval_input you make up must have `user_message` (string), `customer_profile` (dict with `name`, `account_id`, `tier`), and `conversation_history` (list of message dicts). |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +## Setting expected_output |
| 101 | + |
| 102 | +`expected_output` is a **reference for evaluation** — its meaning depends on which evaluator will consume it. |
| 103 | + |
| 104 | +### When to set it |
| 105 | + |
| 106 | +| Scenario | expected_output value | Evaluator it pairs with | |
| 107 | +| ------------------------------------------- | -------------------------------------------------------------------------------------- | ---------------------------------------------------------- | |
| 108 | +| Deterministic answer exists | The exact answer: `"Paris"` | `ExactMatchEval`, `FactualityEval`, `ClosedQAEval` | |
| 109 | +| Open-ended but has quality criteria | Description of good output: `"Should mention Saturday hours and be under 2 sentences"` | `create_llm_evaluator` with `{expected_output}` in prompt | |
| 110 | +| Truly open-ended, no case-specific criteria | Leave as `"UNSET"` or omit | Standalone evaluators (`PossibleEval`, `FaithfulnessEval`) | |
| 111 | + |
| 112 | +### Universal vs. case-specific criteria |
| 113 | + |
| 114 | +- **Universal criteria** apply to ALL test cases → implement in the test function's evaluators (e.g., "responses must be concise", "must not hallucinate"). These don't need expected_output. |
| 115 | +- **Case-specific criteria** vary per test case → carry as `expected_output` in the dataset item (e.g., "should mention the caller's Tuesday appointment", "should route to billing"). |
| 116 | + |
| 117 | +### Anti-patterns |
| 118 | + |
| 119 | +- **Don't generate both eval_output and expected_output from the same source.** If they're identical and you use `ExactMatchEval`, the test is circular and catches zero regressions. |
| 120 | +- **Don't use comparison evaluators (`FactualityEval`, `ClosedQAEval`, `ExactMatchEval`) on items without expected_output.** They produce meaningless scores. |
| 121 | +- **Don't mix expected_output semantics in one dataset.** If some items use expected_output as a factual answer and others as style guidance, evaluators can't handle both. Split into separate datasets or use separate test functions. |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +## Validating the dataset |
| 126 | + |
| 127 | +After creating the dataset, check: |
| 128 | + |
| 129 | +### 1. Structural validation |
| 130 | + |
| 131 | +Every eval_input must match the reference trace's schema: |
| 132 | + |
| 133 | +- Same fields present |
| 134 | +- Same types (string, int, list, dict) |
| 135 | +- Same nesting depth |
| 136 | +- No extra or missing fields compared to what the function expects |
| 137 | + |
| 138 | +### 2. Semantic validation |
| 139 | + |
| 140 | +- **Realistic values** — names, messages, and data look like real-world inputs, not test placeholders |
| 141 | +- **Coherent scenarios** — if there's conversation history, it should make topical sense with the user message |
| 142 | +- **External dependency data makes sense** — customer profiles have realistic account IDs, retrieved documents are plausible |
| 143 | + |
| 144 | +### 3. Diversity validation |
| 145 | + |
| 146 | +- Items have **meaningfully different** inputs — different user intents, different customer types, different edge cases |
| 147 | +- Not just minor variations of the same scenario (e.g., don't have 5 items that are all "What are your hours?" with different names) |
| 148 | +- Cover: normal cases, edge cases, things the app might plausibly get wrong |
| 149 | + |
| 150 | +### 4. Expected_output validation |
| 151 | + |
| 152 | +- case-specific `expected_output` values are specific and testable, not vague |
| 153 | +- Items where expected_output is universal don't redundantly carry expected_output |
| 154 | + |
| 155 | +### 5. Verify by listing |
| 156 | + |
| 157 | +```bash |
| 158 | +pixie dataset list |
| 159 | +``` |
| 160 | + |
| 161 | +Or in the build script: |
| 162 | + |
| 163 | +```python |
| 164 | +ds = store.get("qa-golden-set") |
| 165 | +print(f"Dataset has {len(ds.items)} items") |
| 166 | +for i, item in enumerate(ds.items): |
| 167 | + print(f" [{i}] input keys: {list(item.eval_input.keys()) if isinstance(item.eval_input, dict) else type(item.eval_input)}") |
| 168 | + print(f" expected_output: {item.expected_output[:80] if item.expected_output != 'UNSET' else 'UNSET'}...") |
| 169 | +``` |
| 170 | + |
| 171 | +--- |
| 172 | + |
| 173 | +## Recommended build_dataset.py structure |
| 174 | + |
| 175 | +Put the build script at `pixie_qa/scripts/build_dataset.py`: |
| 176 | + |
| 177 | +```python |
| 178 | +"""Build the eval dataset with made-up scenarios. |
| 179 | +
|
| 180 | +Each eval_input matches the data shape from the reference trace (Step 2). |
| 181 | +Run this script to create/recreate the dataset. |
| 182 | +""" |
| 183 | +from pixie import DatasetStore, Evaluable |
| 184 | + |
| 185 | +DATASET_NAME = "qa-golden-set" |
| 186 | + |
| 187 | +def build() -> None: |
| 188 | + store = DatasetStore() |
| 189 | + |
| 190 | + # Recreate fresh |
| 191 | + try: |
| 192 | + store.delete(DATASET_NAME) |
| 193 | + except FileNotFoundError: |
| 194 | + pass |
| 195 | + store.create(DATASET_NAME) |
| 196 | + |
| 197 | + items = [ |
| 198 | + # Normal case — straightforward question |
| 199 | + Evaluable( |
| 200 | + eval_input={ |
| 201 | + "user_message": "What are your business hours?", |
| 202 | + "customer_profile": {"name": "Alice Johnson", "account_id": "C100", "tier": "gold"}, |
| 203 | + }, |
| 204 | + expected_output="Should mention Mon-Fri 9am-5pm and Sat 10am-2pm", |
| 205 | + ), |
| 206 | + # Edge case — ambiguous request |
| 207 | + Evaluable( |
| 208 | + eval_input={ |
| 209 | + "user_message": "I want to change something", |
| 210 | + "customer_profile": {"name": "Bob Smith", "account_id": "C200", "tier": "basic"}, |
| 211 | + }, |
| 212 | + expected_output="Should ask for clarification about what to change", |
| 213 | + ), |
| 214 | + # ... more items covering different scenarios |
| 215 | + ] |
| 216 | + |
| 217 | + for item in items: |
| 218 | + store.append(DATASET_NAME, item) |
| 219 | + |
| 220 | + # Verify |
| 221 | + ds = store.get(DATASET_NAME) |
| 222 | + print(f"Dataset '{DATASET_NAME}' has {len(ds.items)} items") |
| 223 | + for i, entry in enumerate(ds.items): |
| 224 | + keys = list(entry.eval_input.keys()) if isinstance(entry.eval_input, dict) else type(entry.eval_input) |
| 225 | + print(f" [{i}] input keys: {keys}") |
| 226 | + |
| 227 | +if __name__ == "__main__": |
| 228 | + build() |
| 229 | +``` |
| 230 | + |
| 231 | +--- |
| 232 | + |
| 233 | +## The cardinal rule |
| 234 | + |
| 235 | +**`eval_output` is always produced at test time, never stored in the dataset.** The dataset contains `eval_input` (made-up input matching the reference trace shape) and optionally `expected_output` (the reference to judge against). The test's `runnable` function produces `eval_output` by replaying `eval_input` through the real app. |
0 commit comments