Skip to content

test(uipath-planner): add Lane A task-derivation + single-project deferral coverage#1641

Draft
RaduAna-Maria wants to merge 3 commits into
mainfrom
test/planner-lane-a-and-deferral-coverage
Draft

test(uipath-planner): add Lane A task-derivation + single-project deferral coverage#1641
RaduAna-Maria wants to merge 3 commits into
mainfrom
test/planner-lane-a-and-deferral-coverage

Conversation

@RaduAna-Maria

Copy link
Copy Markdown
Contributor

What

Adds two coverage tests for uipath-planner. Every existing planner test exercises Phase D / SDD generation only — these fill two untested gaps surfaced while triaging the SDD timeout work (PR #1636).

1. Lane A — PDD-driven task derivation (integration)

tests/tasks/uipath-planner/lane_a_task_derivation/

Lane A (read an SDD with a ## Planner Handoff marker → derive the task list → route to specialists) had zero coverage. The test stages a finished, non-UI RPA SDD fixture (vendor-payment-sync-sdd.md, with a populated handoff header) into the sandbox and asserts the skill:

  • detects the marker and routes to Lane A,
  • writes vendor-payment-sync-tasks.md (the file named in the handoff header),
  • routes work to the correct specialists (uipath-rpa build + uipath-platform for the queue/asset/Orchestrator deploy),
  • follows the plan-and-tasks task-row schema (Task T1, Identity:, Status:, Skill prompt, Blocked by:),
  • includes a mandatory Testing task and the anti-hallucination rule,
  • does not re-author the SDD or start building (no .xaml authored).

It's cheap by design — task derivation from a provided SDD, no full SDD authoring — and caps max_thinking_tokens to keep the turn short (the thinking-bound-turn lever from the PR #1636 investigation).

2. Single-project deferral (smoke)

tests/tasks/uipath-planner/single_project_deferral.yaml

A self-contained Maestro Flow (decision + HTTP + inline HITL nodes) must route to one specialist (uipath-maestro-flow), not fan out through the multi-project planner. Inline nodes are author sub-steps, not separate buildable projects. Guards the SKILL.md "skip for single-project" rule — explicitly called out there as the most common mis-trigger. Graded as a one-line routing decision plus a check that no planner plan/tasks/SDD artifact was produced.

Validation

  • Both pass coder-eval plan (schema + agent-config resolution).
  • Both carry the full required tag set: skill + tier + mode:* + lifecycle:*.
  • Behavioral runs pending via the Run Coder Eval workflow (workflow_dispatch, task_globs: tasks/uipath-planner/lane_a_task_derivation/lane_a_task_derivation.yaml tasks/uipath-planner/single_project_deferral.yaml). The auto smoke check will run the deferral (smoke-tagged) task automatically.

Independent of #1636 (branched from main); no overlap with the timeout/filename fixes.

🤖 Generated with Claude Code

…erral coverage

Two gaps the existing planner suite missed — every prior planner test exercised
Phase D / SDD generation only:

1. Lane A (PDD-driven task derivation), integration. Stages a finished SDD with a
   `## Planner Handoff` marker into the sandbox; asserts the skill detects the
   marker, routes to Lane A, and writes `<process>-tasks.md` that routes work to
   the right specialists (uipath-rpa + uipath-platform), follows the task-row
   schema, carries the anti-hallucination rule, includes a mandatory testing
   task, and does NOT re-author the SDD or start building. Caps
   max_thinking_tokens to keep the derivation turn short.

2. Single-project deferral, smoke. A self-contained Maestro Flow (decision +
   HTTP + inline HITL nodes) must route to ONE specialist (uipath-maestro-flow),
   not fan out through the multi-project planner. Guards the SKILL.md "skip for
   single-project" rule / most-common-mis-trigger boundary.

Both validated with `coder-eval plan`. New tasks carry the full required tag set
(skill + tier + mode + lifecycle).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Claude finished @RaduAna-Maria's task in 3m 42s —— View job


Coder-eval task lint (advisory)

2 task YAMLs changed; verdict counts: 0 Critical, 0 High, 0 Medium, 1 Low, 1 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

High — PR body says "Both pass coder-eval plan (schema + agent-config resolution)" and "Behavioral runs pending via the Run Coder Eval workflow." Schema validation (coder-eval plan) is not a behavioral passing run. The body explicitly marks behavioral runs as pending, so no passing-run claim exists yet. Please edit the PR description once the runs complete to add a line like:

Ran skill-planner-lane-a-task-derivation and skill-planner-smoke-single-project-deferral — both passing.

Per-task lint

tests/tasks/uipath-planner/lane_a_task_derivation/lane_a_task_derivation.yaml — verdict: OK

No issues. Criteria check structural markers (Task T1, Identity:, Skill prompt, Blocked by:, uipath-rpa, uipath-platform, "do not infer or guess") that originate from the skill's internal templates and are not leaked by the prompt or the SDD fixture — strong evidence that the skill was actually exercised. The negative assertion (no .xaml authored) is a useful Lane-A guardrail. run_limits are correctly at top level. Solid first test for Lane A.

tests/tasks/uipath-planner/single_project_deferral.yaml — verdict: Low

Issues:

  • [Low] Could pass for the wrong reason: the prompt says "(for example uipath-maestro-flow)" (line 23) and the primary criterion (file_contains, line 36, weight 3.0) checks for exactly "uipath-maestro-flow". An agent could write that string to routing.txt without loading any skill — the ground-truth anchor carve-out partially applies (the prompt needs to supply the expected format), but the answer is given away as the example. A skill_triggered: uipath-maestro-flow criterion would close the gap by proving the correct skill was actually loaded.

Suggested fixes:

  • Consider adding a skill_triggered criterion (even at low weight, e.g. 1.0) to verify the agent actually loaded uipath-maestro-flow rather than just echoing the example. Alternatively, rephrase to remove the example: "Write the skill name to routing.txt" — the agent must then derive it.

Within-PR duplicates

No duplicate clusters detected. The two tasks exercise materially different workflows (Lane A task derivation vs. negative routing decision) at different tiers (integration vs. smoke).

Conclusion

⚠ 1 task has a Low-severity issue; 1 task is clean. Evidence of passing run is missing (High). Advisory only — not blocking merge. The tests themselves are well-designed and fill genuine coverage gaps (Lane A had zero prior tests; single-project deferral guards the most common mis-trigger).


@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Claude finished @RaduAna-Maria's task in 2m 29s —— View job


PR Review: test(uipath-planner) — Lane A + Deferral Coverage

  • Read review criteria and project conventions
  • Read full diff and changed files
  • Check test task structure and tags
  • Validate fixture content and links
  • Check for anti-patterns and rule compliance
  • Post review

Summary

Adds two new coder_eval tasks for uipath-planner — an integration test for Lane A (PDD-driven task derivation from a staged SDD fixture) and a smoke test for single-project deferral (routing a one-project Maestro Flow request to uipath-maestro-flow instead of the planner). Both fill genuine coverage gaps: all prior planner tests exercised Phase D / SDD generation only.

Change-by-Change Review

1. tests/tasks/uipath-planner/lane_a_task_derivation/lane_a_task_derivation.yaml

Severity: OK

Well-structured integration task. Tags [uipath-planner, integration, mode:build, lifecycle:generate] carry all four required dimensions. task_id follows the skill-<domain>-<capability> pattern. The agent: block overrides only type (needed for sdk_options acceptance) and max_thinking_tokens — both justified in comments. The run_limits overrides (task_timeout: 1800, turn_timeout: 1500) exceed experiment defaults but are documented with rationale.

Success criteria grade behavior exclusively (file_exists, file_contains, run_command) — no self-report grading. Weight distribution is appropriate: 2.0 for artifact existence and routing correctness, 1.5 for schema compliance, 1.0 for supporting checks. The negation guard (! find . -name '*.xaml' ...) correctly verifies the agent stayed in Lane A without building.

Prompt is goal-oriented and minimal — no CLI flag enumeration.

2. tests/tasks/uipath-planner/lane_a_task_derivation/fixtures/vendor-payment-sync-sdd.md

Severity: OK

Realistic non-UI RPA SDD fixture. Contains the ## Planner Handoff marker with planner-handoff:v1 HTML comment, a populated handoff table (tasks file, scope, delivery model), and enough sections (process overview, application inventory, project structure, credentials, deployment) for the skill to derive tasks from. No secrets, personal paths, or cross-skill references. Clean and self-contained.

3. tests/tasks/uipath-planner/single_project_deferral.yaml

Severity: OK

Clean smoke task. Tags [uipath-planner, smoke, mode:build, lifecycle:discover] carry all four required dimensions. lifecycle:discover is the right choice — this is a routing decision, not artifact generation. No agent: or run_limits: overrides — inherits everything from the experiment, which is correct for a lightweight smoke task.

Success criteria are tight: primary validation (file_contains "uipath-maestro-flow", weight 3.0), negative check (! grep -qi planner routing.txt, weight 2.0), and artifact-absence guard (! find . \( -name '*-tasks.md' ..., weight 1.0). The prompt explicitly instructs the agent to write its answer to routing.txt with no explanation — keeps grading deterministic.

What's Missing

Nothing identified. CODEOWNERS already covers both /skills/uipath-planner/ and /tests/tasks/uipath-planner/ (lines 68–69). No sandbox env_packages misuse. No cross-skill imports.

Area Ratings

Area Status Notes
Frontmatter N/A Test tasks, not skill definitions
E2E Tests OK Two new tasks with full required tag sets; behavioral runs noted as pending in the PR body
Skill Body N/A No skill changes
References & Assets OK Fixture is self-contained and well-structured
Repo Hygiene OK CODEOWNERS covered; no secrets; changes scoped to planner tests only

Issues for Manual Review

  • Behavioral run results: The PR notes that coder-eval behavioral runs are pending via the Run Coder Eval workflow dispatch. Results should be verified before merge — coder-eval plan validates schema but not whether the agent actually produces the expected artifacts.
  • max_thinking_tokens: 10000 cap: This is an intentional lever from the PR fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps #1636 investigation. Worth monitoring whether it's sufficient for the Lane A derivation — if the SDD is more complex than this fixture, the cap may need revisiting.

Conclusion

Clean PR — no issues found. Both tasks are well-structured, properly tagged, grade on behavior rather than self-reports, and fill real coverage gaps. Approve once behavioral run results confirm the tasks pass.


RaduAna-Maria and others added 2 commits June 23, 2026 14:46
… criteria

Run 28023294472 showed the Lane A task scored 3/7 because the planner skill
NEVER LOADED — tools used were {Read: 1, Write: 1}, no Skill invocation. The
agent produced a good but ad-hoc task list from general reasoning, so it missed
the documented plan-and-tasks-format schema (Task T<N>, Identity, Status, Skill
prompt + anti-hallucination rule). The prompt only said "derive the task list"
with no instruction to load the skill, so the agent took the shortcut.

Fixes:
- Prompt now explicitly loads uipath-planner and follows its PDD-driven (Lane A)
  workflow + documented tasks-file format. This is an integration test of Lane A
  BEHAVIOR; auto-activation on an SDD is a separate (activation) concern.
- Removed the max_thinking_tokens cap so the agent can read and apply the format
  guide faithfully.
- Softened two brittle criteria: testing task accepts Testing/test case/
  uipath-test; header check references the source SDD + autonomy instead of
  exact label strings.

Deferral smoke test passed (4/4) unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l test prompt

Run 28023294472 showed the Lane A test's planner skill never loaded — tools used
were only {Read, Write}, no Skill invocation. Forcing the load in the prompt
would have masked the real gap, so this fixes the cause instead.

Root cause: the description triggers on the literal `sdd.md` filename, but the
skill's own convention writes `<process-kebab>-sdd.md`, and it frames task
derivation as a step that follows authoring rather than a first-class entry for
an SDD the user already has. So PDD->SDD activated (pdd_to_sdd passes) but
SDD->derive-tasks did not.

- SKILL.md description: trigger on PDD / SDD files (`pdd.md`, `*-sdd.md`) and
  make "derives the task list from an existing SDD" a first-class capability.
- Lane A test prompt: reverted the forced skill-load to a natural user request
  ("I've finished the SDD ... plan the build from it"), so the test exercises
  real auto-activation rather than a hand-fed skill.

Note: the SKILL.md description change trips the activation recall-eval gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@RaduAna-Maria RaduAna-Maria marked this pull request as draft June 23, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants