github
diff --git a/‎docs/README.skills.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/README.skills.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎skills/eval-driven-dev/SKILL.md‎
Lines changed: 56 additions & 287 deletions b/‎skills/eval-driven-dev/SKILL.md‎
Lines changed: 56 additions & 287 deletions
diff --git a/‎skills/eval-driven-dev/references/1-a-entry-point.md‎
Lines changed: 68 additions & 0 deletions b/‎skills/eval-driven-dev/references/1-a-entry-point.md‎
Lines changed: 68 additions & 0 deletions
diff --git a/‎skills/eval-driven-dev/references/1-b-eval-criteria.md‎
Lines changed: 82 additions & 0 deletions b/‎skills/eval-driven-dev/references/1-b-eval-criteria.md‎
Lines changed: 82 additions & 0 deletions
@@ -131,7 +131,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
 | [ef-core](../skills/ef-core/SKILL.md) | Get best practices for Entity Framework Core | None |
 | [email-drafter](../skills/email-drafter/SKILL.md) | Draft and review professional emails that match your personal writing style. Analyzes your sent emails for tone, greeting, structure, and sign-off patterns via WorkIQ, then generates context-aware drafts for any recipient. USE FOR: draft email, write email, compose email, reply email, follow-up email, analyze email tone, email style. | None |
 | [entra-agent-user](../skills/entra-agent-user/SKILL.md) | Create Agent Users in Microsoft Entra ID from Agent Identities, enabling AI agents to act as digital workers with user identity capabilities in Microsoft 365 and Azure environments. | None |
-| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/dataset-generation.md`<br />`references/eval-tests.md`<br />`references/instrumentation.md`<br />`references/investigation.md`<br />`references/pixie-api.md`<br />`references/run-harness-patterns.md`<br />`references/understanding-app.md` |
+| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/1-a-entry-point.md`<br />`references/1-b-eval-criteria.md`<br />`references/2-wrap-and-trace.md`<br />`references/3-define-evaluators.md`<br />`references/4-build-dataset.md`<br />`references/5-run-tests.md`<br />`references/6-investigate.md`<br />`references/evaluators.md`<br />`references/testing-api.md`<br />`references/wrap-api.md`<br />`resources` |
 | [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md`<br />`references/excalidraw-schema.md`<br />`scripts/.gitignore`<br />`scripts/README.md`<br />`scripts/add-arrow.py`<br />`scripts/add-icon-to-diagram.py`<br />`scripts/split-excalidraw-library.py`<br />`templates` |
 | [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices. | `references/getdata.md`<br />`references/pyspark.md` |
 | [fedora-linux-triage](../skills/fedora-linux-triage/SKILL.md) | Triage and resolve Fedora issues with dnf, systemd, and SELinux-aware guidance. | None |
 
@@ -0,0 +1,68 @@
+# Step 1a: Entry Point & Execution Flow
+
+Identify how the application starts and how a real user invokes it.
+
+---
+
+## What to investigate
+
+### 1. How the software runs
+
+What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
+
+Look for:
+
+- `if __name__ == "__main__"` blocks
+- Framework entry points (FastAPI `app`, Flask `app`, Django `manage.py`)
+- CLI entry points in `pyproject.toml` (`[project.scripts]`)
+- Docker/compose configs that reveal startup commands
+
+### 2. The real user entry point
+
+How does a real user or client invoke the app? This is what the eval must exercise — not an inner function that bypasses the request pipeline.
+
+- **Web server**: Which HTTP endpoints accept user input? What methods (GET/POST)? What request body shape?
+- **CLI**: What command-line arguments does the user provide?
+- **Library/function**: What function does the caller import and call? What arguments?
+
+### 3. Environment and configuration
+
+- What env vars does the app require? (API keys, database URLs, feature flags)
+- What config files does it read?
+- What has sensible defaults vs. what must be explicitly set?
+
+---
+
+## Output: `pixie_qa/01-entry-point.md`
+
+Write your findings to this file. Keep it focused — only entry point and execution flow.
+
+### Template
+
+```markdown
+# Entry Point & Execution Flow
+
+## How to run
+
+<Command to start the app, required env vars, config files>
+
+## Entry point
+
+- **File**: <e.g., app.py, main.py>
+- **Type**: <FastAPI server / CLI / standalone function / etc.>
+- **Framework**: <FastAPI, Flask, Django, none>
+
+## User-facing endpoints / interface
+
+<For each way a user interacts with the app:>
+
+- **Endpoint / command**: <e.g., POST /chat, python main.py --query "...">
+- **Input format**: <request body shape, CLI args, function params>
+- **Output format**: <response shape, stdout format, return type>
+
+## Environment requirements
+
+| Variable | Purpose | Required? | Default |
+| -------- | ------- | --------- | ------- |
+| ...      | ...     | ...       | ...     |
+```
@@ -0,0 +1,82 @@
+# Step 1b: Eval Criteria
+
+Define what quality dimensions matter for this app — based on the entry point (`01-entry-point.md`) you've already documented.
+
+This document serves two purposes:
+
+1. **Dataset creation (Step 4)**: The use cases tell you what kinds of items to generate — each use case should have representative items in the dataset.
+2. **Evaluator selection (Step 3)**: The eval criteria tell you what evaluators to choose and how to map them.
+
+Keep this concise — it's a planning artifact, not a comprehensive spec.
+
+---
+
+## What to define
+
+### 1. Use cases
+
+List the distinct scenarios the app handles. Each use case becomes a category of dataset items. **Each use case description must be a concise one-liner that conveys both (a) what the input is and (b) what the expected behavior or outcome is.** The description should be specific enough that someone unfamiliar with the app can understand the scenario and its success criteria.
+
+**Good use case descriptions:**
+
+- "Reroute to human agent on account lookup difficulties"
+- "Answer billing question using customer's plan details from CRM"
+- "Decline to answer questions outside the support domain"
+- "Summarize research findings including all queried sub-topics"
+
+**Bad use case descriptions (too vague):**
+
+- "Handle billing questions"
+- "Edge case"
+- "Error handling"
+
+### 2. Eval criteria
+
+Define **high-level, application-specific eval criteria** — quality dimensions that matter for THIS app. Each criterion will map to an evaluator in Step 3.
+
+**Good criteria are specific to the app's purpose.** Examples:
+
+- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation?"
+- Research report generator: "Does the report address all sub-questions?", "Are claims supported by retrieved sources?"
+- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when context is missing?"
+
+**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.
+
+At this stage, don't pick evaluator classes or thresholds. That comes in Step 3.
+
+### 3. Check criteria applicability and observability
+
+For each criterion:
+
+1. **Determine applicability scope** — does this criterion apply to ALL use cases, or only a subset? If a criterion is only relevant for certain scenarios (e.g., "identity verification" only applies to account-related requests, not general FAQ), mark it clearly. This distinction is critical for Step 4 (dataset creation) because:
+   - **Universal criteria** → become dataset-level default evaluators
+   - **Case-specific criteria** → become item-level evaluators on relevant rows only
+
+2. **Verify observability** — for each criterion, identify what data point in the app needs to be captured as a `wrap()` call to evaluate it. This drives the wrap coverage in Step 2.
+   - If the criterion is about the app's final response → captured by `wrap(purpose="output", name="response")`
+   - If it's about a routing decision → captured by `wrap(purpose="state", name="routing_decision")`
+   - If it's about data the app fetched and used → captured by `wrap(purpose="input", name="...")`
+
+---
+
+## Output: `pixie_qa/02-eval-criteria.md`
+
+Write your findings to this file. **Keep it short** — the template below is the maximum length.
+
+### Template
+
+```markdown
+# Eval Criteria
+
+## Use cases
+
+1. <Use case name>: <one-liner conveying input + expected behavior>
+2. ...
+
+## Eval criteria
+
+| #   | Criterion | Applies to    | Data to capture |
+| --- | --------- | ------------- | --------------- |
+| 1   | ...       | All           | wrap name: ...  |
+| 2   | ...       | Use case 1, 3 | wrap name: ...  |
+```