|
| 1 | +# Step 1b: Eval Criteria |
| 2 | + |
| 3 | +Define what quality dimensions matter for this app — based on the entry point (`01-entry-point.md`) you've already documented. |
| 4 | + |
| 5 | +This document serves two purposes: |
| 6 | + |
| 7 | +1. **Dataset creation (Step 4)**: The use cases tell you what kinds of items to generate — each use case should have representative items in the dataset. |
| 8 | +2. **Evaluator selection (Step 3)**: The eval criteria tell you what evaluators to choose and how to map them. |
| 9 | + |
| 10 | +Keep this concise — it's a planning artifact, not a comprehensive spec. |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## What to define |
| 15 | + |
| 16 | +### 1. Use cases |
| 17 | + |
| 18 | +List the distinct scenarios the app handles. Each use case becomes a category of dataset items. **Each use case description must be a concise one-liner that conveys both (a) what the input is and (b) what the expected behavior or outcome is.** The description should be specific enough that someone unfamiliar with the app can understand the scenario and its success criteria. |
| 19 | + |
| 20 | +**Good use case descriptions:** |
| 21 | + |
| 22 | +- "Reroute to human agent on account lookup difficulties" |
| 23 | +- "Answer billing question using customer's plan details from CRM" |
| 24 | +- "Decline to answer questions outside the support domain" |
| 25 | +- "Summarize research findings including all queried sub-topics" |
| 26 | + |
| 27 | +**Bad use case descriptions (too vague):** |
| 28 | + |
| 29 | +- "Handle billing questions" |
| 30 | +- "Edge case" |
| 31 | +- "Error handling" |
| 32 | + |
| 33 | +### 2. Eval criteria |
| 34 | + |
| 35 | +Define **high-level, application-specific eval criteria** — quality dimensions that matter for THIS app. Each criterion will map to an evaluator in Step 3. |
| 36 | + |
| 37 | +**Good criteria are specific to the app's purpose.** Examples: |
| 38 | + |
| 39 | +- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation?" |
| 40 | +- Research report generator: "Does the report address all sub-questions?", "Are claims supported by retrieved sources?" |
| 41 | +- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when context is missing?" |
| 42 | + |
| 43 | +**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app. |
| 44 | + |
| 45 | +At this stage, don't pick evaluator classes or thresholds. That comes in Step 3. |
| 46 | + |
| 47 | +### 3. Check criteria applicability and observability |
| 48 | + |
| 49 | +For each criterion: |
| 50 | + |
| 51 | +1. **Determine applicability scope** — does this criterion apply to ALL use cases, or only a subset? If a criterion is only relevant for certain scenarios (e.g., "identity verification" only applies to account-related requests, not general FAQ), mark it clearly. This distinction is critical for Step 4 (dataset creation) because: |
| 52 | + - **Universal criteria** → become dataset-level default evaluators |
| 53 | + - **Case-specific criteria** → become item-level evaluators on relevant rows only |
| 54 | + |
| 55 | +2. **Verify observability** — for each criterion, identify what data point in the app needs to be captured as a `wrap()` call to evaluate it. This drives the wrap coverage in Step 2. |
| 56 | + - If the criterion is about the app's final response → captured by `wrap(purpose="output", name="response")` |
| 57 | + - If it's about a routing decision → captured by `wrap(purpose="state", name="routing_decision")` |
| 58 | + - If it's about data the app fetched and used → captured by `wrap(purpose="input", name="...")` |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## Output: `pixie_qa/02-eval-criteria.md` |
| 63 | + |
| 64 | +Write your findings to this file. **Keep it short** — the template below is the maximum length. |
| 65 | + |
| 66 | +### Template |
| 67 | + |
| 68 | +```markdown |
| 69 | +# Eval Criteria |
| 70 | + |
| 71 | +## Use cases |
| 72 | + |
| 73 | +1. <Use case name>: <one-liner conveying input + expected behavior> |
| 74 | +2. ... |
| 75 | + |
| 76 | +## Eval criteria |
| 77 | + |
| 78 | +| # | Criterion | Applies to | Data to capture | |
| 79 | +| --- | --------- | ------------- | --------------- | |
| 80 | +| 1 | ... | All | wrap name: ... | |
| 81 | +| 2 | ... | Use case 1, 3 | wrap name: ... | |
| 82 | +``` |
0 commit comments