Skip to content

Commit 3b0e1d8

Browse files
spboyerCopilot
andcommitted
Sync sensei skill with upstream microsoft/GitHub-Copilot-for-Azure
Updates sensei skill from v1.0.0 to v1.0.5 and adds missing GEPA integration. Changes synced from microsoft/GitHub-Copilot-for-Azure. Updated files: - SKILL.md: v1.0.5 with help banner, batch mode, overlap disambiguation - SCORING.md: routing-regression guard, trigger-overlap rules - README.md, LOOP.md, EXAMPLES.md: testPathPatterns fix, link updates New files: - references/TOKEN-INTEGRATION.md: separate token integration docs - scripts/gepa/auto_evaluator.py: GEPA quality scoring evaluator - .github/workflows/gepa-quality-score.yml: CI quality scoring - .github/workflows/gepa-quality-score-comment.yml: PR score comments Path references updated from plugin/skills/ to .github/skills/ for this repository's layout. Fixes #15096 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 762db46 commit 3b0e1d8

File tree

9 files changed

+1154
-186
lines changed

9 files changed

+1154
-186
lines changed

.github/skills/sensei/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Sensei automates the improvement of [Agent Skills](https://agentskills.io) front
2222

2323
### The Problem
2424

25-
The [frontmatter audit](https://gist.github.com/spboyer/28c31bf0cafb87489406832633aa31a7) revealed that all SDK skills have:
25+
The [frontmatter audit](https://gist.github.com/spboyer/28c31bf0cafb87489406832633aa31a7) revealed that all Azure skills have:
2626
- **0% High adherence** - No skills have triggers + anti-triggers + compatibility
2727
- **46% Low adherence** - 12 skills have minimal descriptions without clear triggers
2828
- **0/26 anti-triggers** - No skills tell agents when NOT to use them
@@ -108,7 +108,7 @@ cd tests
108108
npm install
109109

110110
# Verify tests run
111-
npm test -- --testPathPattern=azure-validation
111+
npm test -- --testPathPatterns=azure-validation
112112
```
113113

114114
---
@@ -164,7 +164,7 @@ npm test -- --testPathPattern=azure-validation
164164
└─────────────────────┬───────────────────────────────────┘
165165
166166
┌─────────────────────────────────────────────────────────┐
167-
│ 6. VERIFY: npm test -- --testPathPattern={skill-name} │
167+
│ 6. VERIFY: npm test -- --testPathPatterns={skill-name} │
168168
│ • If tests fail → fix and retry │
169169
│ • If tests pass → continue │
170170
└─────────────────────┬───────────────────────────────────┘
@@ -283,7 +283,7 @@ To reach Medium-High, a skill must have:
283283

284284
### Token Budget
285285

286-
From **skill-authoring**:
286+
From [skill-authoring](/.github/skills/skill-authoring):
287287
- **SKILL.md:** < 500 tokens (soft), < 5000 (hard)
288288
- **references/*.md:** < 1000 tokens each
289289
- Check with: `cd scripts && npm run tokens -- check plugin/skills/{skill}/SKILL.md`
@@ -297,7 +297,7 @@ From **skill-authoring**:
297297
```yaml
298298
---
299299
name: appinsights-instrumentation
300-
description: 'Implement retry logic for HTTP client requests with exponential backoff'
300+
description: 'Instrument a webapp to send useful telemetry data to Azure App Insights'
301301
---
302302
```
303303

@@ -313,7 +313,7 @@ description: 'Implement retry logic for HTTP client requests with exponential ba
313313
---
314314
name: appinsights-instrumentation
315315
description: >-
316-
Implement retry logic with exponential backoff for HTTP requests.
316+
Instrument web apps to send telemetry to Azure Application Insights.
317317
USE FOR: "add App Insights", "instrument my app", "set up monitoring",
318318
"add telemetry", "track requests", "ASP.NET Core telemetry", "Node.js monitoring".
319319
DO NOT USE FOR: querying logs (use azure-observability), creating alerts,
@@ -368,7 +368,7 @@ const shouldNotTriggerPrompts = [
368368
3. Run tests manually to see specific failures:
369369
```bash
370370
cd tests
371-
npm test -- --testPathPattern={skill-name} --verbose
371+
npm test -- --testPathPatterns={skill-name} --verbose
372372
```
373373

374374
### Skill Not Reaching Target Score
@@ -441,5 +441,5 @@ If Sensei produces unexpected results:
441441

442442
### Related Skills
443443

444-
- **markdown-token-optimizer** - Token analysis and optimization suggestions
445-
- **skill-authoring** - Guidelines for writing compliant Agent Skills
444+
- [markdown-token-optimizer](/.github/skills/markdown-token-optimizer) - Token analysis and optimization suggestions
445+
- [skill-authoring](/.github/skills/skill-authoring) - Guidelines for writing compliant Agent Skills

.github/skills/sensei/SKILL.md

Lines changed: 200 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,64 +1,230 @@
11
---
22
name: sensei
3-
description: "Improve skill frontmatter compliance iteratively using the Ralph loop pattern. **WORKFLOW SKILL**. WHEN: \"run sensei\", \"sensei help\", \"improve skill\", \"fix frontmatter\", \"skill compliance\", \"frontmatter audit\", \"score skill\", \"check skill tokens\". DO NOT USE FOR: writing new skills from scratch (use skill-authoring), general code review. INVOKES: waza CLI, git."
3+
description: "**WORKFLOW SKILL** — Iteratively improve skill frontmatter compliance using the Ralph loop pattern. WHEN: \"run sensei\", \"sensei help\", \"improve skill\", \"fix frontmatter\", \"skill compliance\", \"frontmatter audit\", \"score skill\", \"check skill tokens\". INVOKES: token counting tools, test runners, git commands. FOR SINGLE OPERATIONS: use token CLI directly for counts/checks."
44
license: MIT
55
metadata:
66
author: Microsoft
7-
version: "1.0.0"
8-
compatibility:
9-
platforms: "copilot-chat"
7+
version: "1.0.5"
108
---
119

1210
# Sensei
1311

14-
Iteratively improve skills until they pass waza check compliance.
12+
> "A true master teaches not by telling, but by refining." - The Skill Sensei
1513
16-
## Usage
14+
Automates skill frontmatter improvement using the [Ralph loop pattern](https://github.com/soderlind/ralph) - iteratively improving skills until they reach Medium-High compliance with passing tests, then checking token usage and prompting for action.
1715

16+
## Help
17+
18+
When user says "sensei help" or asks how to use sensei, show this:
19+
20+
```
21+
╔══════════════════════════════════════════════════════════════════╗
22+
║ SENSEI - Skill Frontmatter Compliance Improver ║
23+
╠══════════════════════════════════════════════════════════════════╣
24+
║ ║
25+
║ USAGE: ║
26+
║ Run sensei on <skill-name> # Single skill ║
27+
║ Run sensei on <skill-name> --skip-integration # Fast mode ║
28+
║ Run sensei on <skill1>, <skill2>, ... # Multiple skills ║
29+
║ Run sensei on all Low-adherence skills # Batch by score ║
30+
║ Run sensei on all skills # All skills ║
31+
║ ║
32+
║ EXAMPLES: ║
33+
║ Run sensei on appinsights-instrumentation ║
34+
║ Run sensei on azure-security --skip-integration ║
35+
║ Run sensei on azure-security, azure-observability ║
36+
║ Run sensei on all Low-adherence skills ║
37+
║ ║
38+
║ WHAT IT DOES: ║
39+
║ 1. READ - Load skill's SKILL.md, tests, and token count ║
40+
║ 2. SCORE - Check compliance (Low/Medium/Medium-High/High) ║
41+
║ 3. SCAFFOLD - Create tests from template if missing ║
42+
║ 4. IMPROVE - Add WHEN: triggers (cross-model optimized) ║
43+
║ 5. TEST - Run tests, fix if needed ║
44+
║ 6. REFERENCES- Validate markdown links ║
45+
║ 7. TOKENS - Check token budget, gather suggestions ║
46+
║ 8. SUMMARY - Show before/after with suggestions ║
47+
║ 9. PROMPT - Ask: Commit, Create Issue, or Skip? ║
48+
║ 10. REPEAT - Until Medium-High score + tests pass ║
49+
║ ║
50+
║ TARGET SCORE: Medium-High ║
51+
║ ✓ Description > 150 chars, ≤ 60 words ║
52+
║ ✓ Has "WHEN:" trigger phrases (preferred) ║
53+
║ ✓ No "DO NOT USE FOR:" (unless disambiguation-critical) ║
54+
║ ✓ SKILL.md < 500 tokens (soft limit) ║
55+
║ ║
56+
║ MORE INFO: ║
57+
║ See .github/skills/sensei/README.md for full documentation ║
58+
║ ║
59+
╚══════════════════════════════════════════════════════════════════╝
60+
```
61+
62+
## When to Use
63+
64+
- Improving a skill's frontmatter compliance score
65+
- Adding trigger phrases and anti-triggers to skill descriptions
66+
- Batch-improving multiple skills at once
67+
- Auditing and fixing Low-adherence skills
68+
69+
## Invocation Modes
70+
71+
### Single Skill
72+
```
73+
Run sensei on azure-deploy
74+
```
75+
76+
### Multiple Skills
77+
```
78+
Run sensei on azure-security, azure-observability
79+
```
80+
81+
### By Adherence Level
82+
```
83+
Run sensei on all Low-adherence skills
84+
```
85+
86+
### All Skills
1887
```
19-
Run sensei on <skill-name>
20-
Run sensei on <skill1>, <skill2>
2188
Run sensei on all skills
2289
```
2390

91+
### GEPA Mode (Deep Optimization)
92+
```
93+
Run sensei on my-skill --gepa
94+
Run sensei on my-skill --gepa --skip-integration
95+
Run sensei on all skills --gepa
96+
```
97+
98+
When `--gepa` is used, Step 5 (IMPROVE) is replaced with GEPA evolutionary optimization.
99+
Instead of template-based improvements, GEPA parses trigger prompt arrays from the existing
100+
test harness and combines them with content quality heuristics to build a fitness function.
101+
An LLM proposes and evaluates many candidate improvements automatically. Note: GEPA does not
102+
execute Jest tests directly — it uses the test data (prompts) as evaluation inputs.
103+
104+
**GEPA score-only mode** (no LLM calls, just evaluate current quality):
105+
```
106+
Run sensei score my-skill
107+
Run sensei score all skills
108+
```
109+
24110
## The Ralph Loop
25111

26-
For each skill, repeat until compliant (max 5 iterations):
112+
For each skill, execute this loop until score >= Medium-High AND tests pass:
113+
114+
1. **READ** - Load `plugin/skills/{skill-name}/SKILL.md`, tests, and token count
115+
2. **SCORE** - Run spec-based compliance check (see [SCORING.md](references/SCORING.md)):
116+
- Validate `name` per [agentskills.io spec](https://agentskills.io/specification) (no `--`, no start/end `-`, lowercase alphanumeric)
117+
- Check description length and word count (≤60 words)
118+
- Check triggers (WHEN: preferred, USE FOR: accepted)
119+
- Warn on "DO NOT USE FOR:" (risky in multi-skill environments — **exception**: REQUIRED for skills that share trigger overlap with broader skills like `azure-prepare`)
120+
- Preserve optional spec fields (`license`, `metadata`, `allowed-tools`) if present
121+
3. **CHECK** - If score >= Medium-High AND tests pass → go to TOKENS step
122+
4. **SCAFFOLD** - If `tests/{skill-name}/` doesn't exist, create from `tests/_template/`
123+
5. **IMPROVE FRONTMATTER** - Add WHEN: triggers (stay under 60 words and 1024 chars)
124+
5b. **IMPROVE WITH GEPA** (when `--gepa` flag is set) — Replaces step 5 (IMPROVE FRONTMATTER) with automated optimization; step 6 (IMPROVE TESTS) still runs normally:
125+
- Auto-discovers `tests/{skill-name}/triggers.test.ts` and extracts prompt arrays
126+
- Builds a GEPA evaluator scoring content quality + trigger accuracy based on those trigger prompt arrays (not Jest test pass/fail results)
127+
- Runs `python .github/skills/sensei/scripts/gepa/auto_evaluator.py optimize --skill {skill-name} --skills-dir plugin/skills --tests-dir tests`
128+
- Shows diff of optimized SKILL.md for user approval
129+
- GEPA uses existing test trigger definitions as configuration — it does not execute, replace, or modify Jest tests
130+
6. **IMPROVE TESTS** - Update `shouldTriggerPrompts` and `shouldNotTriggerPrompts` to match the finalized frontmatter (including any GEPA changes)
131+
7. **VERIFY** - Run `cd tests && npm test -- --testPathPatterns={skill-name}`
132+
8. **VALIDATE REFERENCES** - Run `cd scripts && npm run references {skill-name}` to check markdown links
133+
9. **TOKENS** - Check token budget and line count (< 500 lines per spec), gather optimization suggestions
134+
10. **SUMMARY** - Display before/after comparison with unimplemented suggestions
135+
11. **PROMPT** - Ask user: Commit, Create Issue, or Skip?
136+
12. **REPEAT** - Go to step 2 (max 5 iterations per skill)
137+
138+
## Scoring Criteria (Quick Reference)
139+
140+
Sensei validates skills against the [agentskills.io specification](https://agentskills.io/specification). See [SCORING.md](references/SCORING.md) for full details.
141+
142+
| Score | Requirements |
143+
|-------|--------------|
144+
| **Invalid** | Name fails spec validation (consecutive hyphens, start/end hyphen, uppercase, etc.) |
145+
| **Low** | Basic description, no explicit triggers |
146+
| **Medium** | Has trigger keywords/phrases, description > 150 chars, >60 words |
147+
| **Medium-High** | Has "WHEN:" (preferred) or "USE FOR:" triggers, ≤60 words |
148+
| **High** | Medium-High + compatibility field |
149+
150+
**Target: Medium-High** (distinctive triggers, concise description)
151+
152+
> ⚠️ "DO NOT USE FOR:" is **risky in multi-skill environments** (15+ overlapping skills) — causes keyword contamination on fast-pattern-matching models. Safe for small, isolated skill sets. Use positive routing with `WHEN:` for cross-model safety.
153+
>
154+
> **Exception — disambiguation-critical skills:** When a skill's `USE FOR` triggers directly overlap with a broader skill (e.g., `azure-prepare` owns "deploy to Azure"), `DO NOT USE FOR:` is **REQUIRED** to prevent the broader skill from capturing prompts that belong to the specialized skill. Removing it causes routing regressions. Integration tests validate this routing -- run them before removing any `DO NOT USE FOR:` clause.
155+
156+
**Strongly recommended** (reported as suggestions if missing):
157+
- `license` — identifies the license applied to the skill
158+
- `metadata.version` — tracks the skill version for consumers
159+
160+
## Frontmatter Template
161+
162+
Per the [agentskills.io spec](https://agentskills.io/specification), required and optional fields:
163+
164+
```yaml
165+
---
166+
name: skill-name
167+
description: "[ACTION VERB] [UNIQUE_DOMAIN]. [One clarifying sentence]. WHEN: \"trigger 1\", \"trigger 2\", \"trigger 3\"."
168+
license: MIT
169+
metadata:
170+
version: "1.0"
171+
# Other optional spec fields — preserve if already present:
172+
# metadata.author: example-org
173+
# allowed-tools: Bash(git:*) Read
174+
---
175+
```
176+
177+
> **IMPORTANT:** Use inline double-quoted strings for descriptions. Do NOT use `>-` folded scalars (incompatible with skills.sh). Do NOT use `|` literal blocks (preserves newlines). Keep total description under 1024 characters and ≤60 words.
178+
179+
> ⚠️ **"DO NOT USE FOR:" carries context-dependent risk.** In multi-skill environments (10+ skills with overlapping domains), anti-trigger clauses introduce the very keywords that cause wrong-skill activation on Claude Sonnet and fast-pattern-matching models ([evidence](https://gist.github.com/kvenkatrajan/52e6e77f5560ca30640490b4cc65d109)). For small, isolated skill sets (1-5 skills), the risk is low. When in doubt, use positive routing with `WHEN:` and distinctive quoted phrases.
180+
>
181+
> **Exception:** `DO NOT USE FOR:` is **REQUIRED** when a specialized skill's triggers overlap with a broader skill (e.g., `azure-hosted-copilot-sdk` vs. `azure-prepare` on "deploy to Azure"). Without the negative discriminator, the broader skill captures prompts that should route to the specialized one. Always run integration tests before removing a `DO NOT USE FOR:` clause.
27182
28-
1. **READ** — Load SKILL.md and current token count
29-
2. **SCORE** — Run `waza check {skill-name}` for compliance
30-
3. **FIX** — Address issues: tokens, broken links, frontmatter
31-
4. **VERIFY** — Re-run `waza check`; loop if issues remain
32-
5. **COMMIT**`sensei: improve {skill-name} frontmatter`
183+
## Test Scaffolding
33184

34-
Target: High compliance, ≤500 tokens, all links valid.
185+
When tests don't exist, scaffold from `tests/_template/`:
35186

36-
## Frontmatter Rules
187+
```bash
188+
cp -r tests/_template tests/{skill-name}
189+
```
190+
191+
Then update:
192+
1. `SKILL_NAME` constant in all test files
193+
2. `shouldTriggerPrompts` - 5+ prompts matching new frontmatter triggers
194+
3. `shouldNotTriggerPrompts` - 5+ prompts matching anti-triggers
37195

38-
- Use inline double-quoted `description` (not `>-` folded scalars)
39-
- Lead with action verb + domain; add `WHEN:` trigger phrases
40-
- Keep ≤60 words, ≤1024 chars
196+
**Commit Messages:**
197+
```
198+
sensei: improve {skill-name} frontmatter
199+
```
41200

42-
## Tools
201+
## Constraints
43202

44-
No MCP servers required — uses waza CLI directly.
203+
- Only modify `plugin/skills/` - these are the Azure skills used by Copilot
204+
- `.github/skills/` contains meta-skills like sensei for developer tooling
205+
- Max 5 iterations per skill before moving on
206+
- Description must stay under 1024 characters
207+
- SKILL.md should stay under 500 tokens (soft limit)
208+
- Tests must pass before prompting for action
209+
- User chooses: Commit, Create Issue, or Skip after each skill
45210

46-
| Tool | Fallback |
47-
|------|----------|
48-
| waza | `waza check` / `waza run` CLI |
49-
| git | Standard git CLI |
211+
## Flags
50212

51-
## Examples
213+
| Flag | Description |
214+
|------|-------------|
215+
| `--skip-integration` | Skip integration tests for faster iteration. Only runs unit and trigger tests. |
216+
| `--gepa` | Use GEPA evolutionary optimization instead of template-based improvement. Auto-discovers tests and builds evaluator at runtime. |
52217

53-
- "Run sensei on pipeline-troubleshooting"
54-
- "Run sensei on all skills"
218+
> ⚠️ Skipping integration tests speeds up the loop but may miss runtime issues. Consider running full tests before final commit.
55219
56-
## Troubleshooting
220+
## Reference Documentation
57221

58-
If waza fails, verify it is installed and you are in the skills directory.
222+
- [SCORING.md](references/SCORING.md) - Detailed scoring criteria
223+
- [LOOP.md](references/LOOP.md) - Ralph loop workflow details
224+
- [EXAMPLES.md](references/EXAMPLES.md) - Before/after examples
225+
- [TOKEN-INTEGRATION.md](references/TOKEN-INTEGRATION.md) - Token budget integration
59226

60-
## References
227+
## Related Skills
61228

62-
- [SCORING.md](references/SCORING.md) — Scoring criteria and token budgets
63-
- [LOOP.md](references/LOOP.md) — Detailed workflow
64-
- [EXAMPLES.md](references/EXAMPLES.md) — Before/after examples
229+
- [markdown-token-optimizer](/.github/skills/markdown-token-optimizer) - Token analysis and optimization
230+
- [skill-authoring](/.github/skills/skill-authoring) - Skill writing guidelines

0 commit comments

Comments
 (0)