test(uipath-planner): add Lane A task-derivation + single-project deferral coverage by RaduAna-Maria · Pull Request #1641 · UiPath/skills

RaduAna-Maria · 2026-06-23T11:31:29Z

What

Adds two coverage tests for uipath-planner. Every existing planner test exercises Phase D / SDD generation only — these fill two untested gaps surfaced while triaging the SDD timeout work (PR #1636).

1. Lane A — PDD-driven task derivation (`integration`)

tests/tasks/uipath-planner/lane_a_task_derivation/

Lane A (read an SDD with a ## Planner Handoff marker → derive the task list → route to specialists) had zero coverage. The test stages a finished, non-UI RPA SDD fixture (vendor-payment-sync-sdd.md, with a populated handoff header) into the sandbox and asserts the skill:

detects the marker and routes to Lane A,
writes vendor-payment-sync-tasks.md (the file named in the handoff header),
routes work to the correct specialists (uipath-rpa build + uipath-platform for the queue/asset/Orchestrator deploy),
follows the plan-and-tasks task-row schema (Task T1, Identity:, Status:, Skill prompt, Blocked by:),
includes a mandatory Testing task and the anti-hallucination rule,
does not re-author the SDD or start building (no .xaml authored).

It's cheap by design — task derivation from a provided SDD, no full SDD authoring — and caps max_thinking_tokens to keep the turn short (the thinking-bound-turn lever from the PR #1636 investigation).

2. Single-project deferral (`smoke`)

tests/tasks/uipath-planner/single_project_deferral.yaml

A self-contained Maestro Flow (decision + HTTP + inline HITL nodes) must route to one specialist (uipath-maestro-flow), not fan out through the multi-project planner. Inline nodes are author sub-steps, not separate buildable projects. Guards the SKILL.md "skip for single-project" rule — explicitly called out there as the most common mis-trigger. Graded as a one-line routing decision plus a check that no planner plan/tasks/SDD artifact was produced.

Validation

Both pass coder-eval plan (schema + agent-config resolution).
Both carry the full required tag set: skill + tier + mode:* + lifecycle:*.
Behavioral runs pending via the Run Coder Eval workflow (workflow_dispatch, task_globs: tasks/uipath-planner/lane_a_task_derivation/lane_a_task_derivation.yaml tasks/uipath-planner/single_project_deferral.yaml). The auto smoke check will run the deferral (smoke-tagged) task automatically.

Independent of #1636 (branched from main); no overlap with the timeout/filename fixes.

🤖 Generated with Claude Code

…erral coverage Two gaps the existing planner suite missed — every prior planner test exercised Phase D / SDD generation only: 1. Lane A (PDD-driven task derivation), integration. Stages a finished SDD with a `## Planner Handoff` marker into the sandbox; asserts the skill detects the marker, routes to Lane A, and writes `<process>-tasks.md` that routes work to the right specialists (uipath-rpa + uipath-platform), follows the task-row schema, carries the anti-hallucination rule, includes a mandatory testing task, and does NOT re-author the SDD or start building. Caps max_thinking_tokens to keep the derivation turn short. 2. Single-project deferral, smoke. A self-contained Maestro Flow (decision + HTTP + inline HITL nodes) must route to ONE specialist (uipath-maestro-flow), not fan out through the multi-project planner. Guards the SKILL.md "skip for single-project" rule / most-common-mis-trigger boundary. Both validated with `coder-eval plan`. New tasks carry the full required tag set (skill + tier + mode + lifecycle). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-23T11:31:46Z

Claude finished @RaduAna-Maria's task in 3m 42s —— View job

Coder-eval task lint (advisory)

2 task YAMLs changed; verdict counts: 0 Critical, 0 High, 0 Medium, 1 Low, 1 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

❌ High — PR body says "Both pass coder-eval plan (schema + agent-config resolution)" and "Behavioral runs pending via the Run Coder Eval workflow." Schema validation (coder-eval plan) is not a behavioral passing run. The body explicitly marks behavioral runs as pending, so no passing-run claim exists yet. Please edit the PR description once the runs complete to add a line like:

Ran skill-planner-lane-a-task-derivation and skill-planner-smoke-single-project-deferral — both passing.

Per-task lint

`tests/tasks/uipath-planner/lane_a_task_derivation/lane_a_task_derivation.yaml` — verdict: OK

No issues. Criteria check structural markers (Task T1, Identity:, Skill prompt, Blocked by:, uipath-rpa, uipath-platform, "do not infer or guess") that originate from the skill's internal templates and are not leaked by the prompt or the SDD fixture — strong evidence that the skill was actually exercised. The negative assertion (no .xaml authored) is a useful Lane-A guardrail. run_limits are correctly at top level. Solid first test for Lane A.

`tests/tasks/uipath-planner/single_project_deferral.yaml` — verdict: Low

Issues:

[Low] Could pass for the wrong reason: the prompt says "(for example uipath-maestro-flow)" (line 23) and the primary criterion (file_contains, line 36, weight 3.0) checks for exactly "uipath-maestro-flow". An agent could write that string to routing.txt without loading any skill — the ground-truth anchor carve-out partially applies (the prompt needs to supply the expected format), but the answer is given away as the example. A skill_triggered: uipath-maestro-flow criterion would close the gap by proving the correct skill was actually loaded.

Suggested fixes:

Consider adding a skill_triggered criterion (even at low weight, e.g. 1.0) to verify the agent actually loaded uipath-maestro-flow rather than just echoing the example. Alternatively, rephrase to remove the example: "Write the skill name to routing.txt" — the agent must then derive it.

Within-PR duplicates

No duplicate clusters detected. The two tasks exercise materially different workflows (Lane A task derivation vs. negative routing decision) at different tiers (integration vs. smoke).

Conclusion

⚠ 1 task has a Low-severity issue; 1 task is clean. Evidence of passing run is missing (High). Advisory only — not blocking merge. The tests themselves are well-designed and fill genuine coverage gaps (Lane A had zero prior tests; single-project deferral guards the most common mis-trigger).

github-actions · 2026-06-23T11:31:46Z

Claude finished @RaduAna-Maria's task in 2m 29s —— View job

PR Review: test(uipath-planner) — Lane A + Deferral Coverage

Read review criteria and project conventions
Read full diff and changed files
Check test task structure and tags
Validate fixture content and links
Check for anti-patterns and rule compliance
Post review

Summary

Adds two new coder_eval tasks for uipath-planner — an integration test for Lane A (PDD-driven task derivation from a staged SDD fixture) and a smoke test for single-project deferral (routing a one-project Maestro Flow request to uipath-maestro-flow instead of the planner). Both fill genuine coverage gaps: all prior planner tests exercised Phase D / SDD generation only.

Change-by-Change Review

1. `tests/tasks/uipath-planner/lane_a_task_derivation/lane_a_task_derivation.yaml`

Severity: OK

Well-structured integration task. Tags [uipath-planner, integration, mode:build, lifecycle:generate] carry all four required dimensions. task_id follows the skill-<domain>-<capability> pattern. The agent: block overrides only type (needed for sdk_options acceptance) and max_thinking_tokens — both justified in comments. The run_limits overrides (task_timeout: 1800, turn_timeout: 1500) exceed experiment defaults but are documented with rationale.

Success criteria grade behavior exclusively (file_exists, file_contains, run_command) — no self-report grading. Weight distribution is appropriate: 2.0 for artifact existence and routing correctness, 1.5 for schema compliance, 1.0 for supporting checks. The negation guard (! find . -name '*.xaml' ...) correctly verifies the agent stayed in Lane A without building.

Prompt is goal-oriented and minimal — no CLI flag enumeration.

2. `tests/tasks/uipath-planner/lane_a_task_derivation/fixtures/vendor-payment-sync-sdd.md`

Severity: OK

Realistic non-UI RPA SDD fixture. Contains the ## Planner Handoff marker with planner-handoff:v1 HTML comment, a populated handoff table (tasks file, scope, delivery model), and enough sections (process overview, application inventory, project structure, credentials, deployment) for the skill to derive tasks from. No secrets, personal paths, or cross-skill references. Clean and self-contained.

3. `tests/tasks/uipath-planner/single_project_deferral.yaml`

Severity: OK

Clean smoke task. Tags [uipath-planner, smoke, mode:build, lifecycle:discover] carry all four required dimensions. lifecycle:discover is the right choice — this is a routing decision, not artifact generation. No agent: or run_limits: overrides — inherits everything from the experiment, which is correct for a lightweight smoke task.

Success criteria are tight: primary validation (file_contains "uipath-maestro-flow", weight 3.0), negative check (! grep -qi planner routing.txt, weight 2.0), and artifact-absence guard (! find . \( -name '*-tasks.md' ..., weight 1.0). The prompt explicitly instructs the agent to write its answer to routing.txt with no explanation — keeps grading deterministic.

What's Missing

Nothing identified. CODEOWNERS already covers both /skills/uipath-planner/ and /tests/tasks/uipath-planner/ (lines 68–69). No sandbox env_packages misuse. No cross-skill imports.

Area Ratings

Area	Status	Notes
Frontmatter	N/A	Test tasks, not skill definitions
E2E Tests	OK	Two new tasks with full required tag sets; behavioral runs noted as pending in the PR body
Skill Body	N/A	No skill changes
References & Assets	OK	Fixture is self-contained and well-structured
Repo Hygiene	OK	CODEOWNERS covered; no secrets; changes scoped to planner tests only

Issues for Manual Review

Behavioral run results: The PR notes that coder-eval behavioral runs are pending via the Run Coder Eval workflow dispatch. Results should be verified before merge — coder-eval plan validates schema but not whether the agent actually produces the expected artifacts.
max_thinking_tokens: 10000 cap: This is an intentional lever from the PR fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps #1636 investigation. Worth monitoring whether it's sufficient for the Lane A derivation — if the SDD is more complex than this fixture, the cap may need revisiting.

Conclusion

Clean PR — no issues found. Both tasks are well-structured, properly tagged, grade on behavior rather than self-reports, and fill real coverage gaps. Approve once behavioral run results confirm the tasks pass.

… criteria Run 28023294472 showed the Lane A task scored 3/7 because the planner skill NEVER LOADED — tools used were {Read: 1, Write: 1}, no Skill invocation. The agent produced a good but ad-hoc task list from general reasoning, so it missed the documented plan-and-tasks-format schema (Task T<N>, Identity, Status, Skill prompt + anti-hallucination rule). The prompt only said "derive the task list" with no instruction to load the skill, so the agent took the shortcut. Fixes: - Prompt now explicitly loads uipath-planner and follows its PDD-driven (Lane A) workflow + documented tasks-file format. This is an integration test of Lane A BEHAVIOR; auto-activation on an SDD is a separate (activation) concern. - Removed the max_thinking_tokens cap so the agent can read and apply the format guide faithfully. - Softened two brittle criteria: testing task accepts Testing/test case/ uipath-test; header check references the source SDD + autonomy instead of exact label strings. Deferral smoke test passed (4/4) unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…l test prompt Run 28023294472 showed the Lane A test's planner skill never loaded — tools used were only {Read, Write}, no Skill invocation. Forcing the load in the prompt would have masked the real gap, so this fixes the cause instead. Root cause: the description triggers on the literal `sdd.md` filename, but the skill's own convention writes `<process-kebab>-sdd.md`, and it frames task derivation as a step that follows authoring rather than a first-class entry for an SDD the user already has. So PDD->SDD activated (pdd_to_sdd passes) but SDD->derive-tasks did not. - SKILL.md description: trigger on PDD / SDD files (`pdd.md`, `*-sdd.md`) and make "derives the task list from an existing SDD" a first-class capability. - Lane A test prompt: reverted the forced skill-load to a natural user request ("I've finished the SDD ... plan the build from it"), so the test exercises real auto-activation rather than a hand-fed skill. Note: the SKILL.md description change trips the activation recall-eval gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RaduAna-Maria and others added 2 commits June 23, 2026 14:46

RaduAna-Maria assigned gabrielavaduva Jun 23, 2026

RaduAna-Maria marked this pull request as draft June 23, 2026 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(uipath-planner): add Lane A task-derivation + single-project deferral coverage#1641

test(uipath-planner): add Lane A task-derivation + single-project deferral coverage#1641
RaduAna-Maria wants to merge 3 commits into
mainfrom
test/planner-lane-a-and-deferral-coverage

RaduAna-Maria commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RaduAna-Maria commented Jun 23, 2026

What

1. Lane A — PDD-driven task derivation (integration)

2. Single-project deferral (smoke)

Validation

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coder-eval task lint (advisory)

Evidence of passing run

Per-task lint

tests/tasks/uipath-planner/lane_a_task_derivation/lane_a_task_derivation.yaml — verdict: OK

tests/tasks/uipath-planner/single_project_deferral.yaml — verdict: Low

Within-PR duplicates

Conclusion

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: test(uipath-planner) — Lane A + Deferral Coverage

Summary

Change-by-Change Review

1. tests/tasks/uipath-planner/lane_a_task_derivation/lane_a_task_derivation.yaml

2. tests/tasks/uipath-planner/lane_a_task_derivation/fixtures/vendor-payment-sync-sdd.md

3. tests/tasks/uipath-planner/single_project_deferral.yaml

What's Missing

Area Ratings

Issues for Manual Review

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Lane A — PDD-driven task derivation (`integration`)

2. Single-project deferral (`smoke`)

github-actions Bot commented Jun 23, 2026 •

edited

Loading

`tests/tasks/uipath-planner/lane_a_task_derivation/lane_a_task_derivation.yaml` — verdict: OK

`tests/tasks/uipath-planner/single_project_deferral.yaml` — verdict: Low

github-actions Bot commented Jun 23, 2026 •

edited

Loading

1. `tests/tasks/uipath-planner/lane_a_task_derivation/lane_a_task_derivation.yaml`

2. `tests/tasks/uipath-planner/lane_a_task_derivation/fixtures/vendor-payment-sync-sdd.md`

3. `tests/tasks/uipath-planner/single_project_deferral.yaml`