fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps by RaduAna-Maria · Pull Request #1636 · UiPath/skills

RaduAna-Maria · 2026-06-23T08:26:01Z

Why

Two unrelated failures in the uipath-planner SDD tasks, plus a latent config issue across the whole planner suite.

1. `pdd_to_sdd` timed out at 1204s (TIMEOUT, score 0)

The task raised turn_timeout: 1800 but never set task_timeout, so it inherited the experiment default task_timeout: 1200. The whole-task wall clock fired before the SDD landed. gap_detection and product_selection had the same bug; every planner task that sets turn_timeout: 1800 was also leaving the task cap below the turn cap — an incoherent config (a turn can't outlive its task).

Fix: set task_timeout: 2700 on every planner task that raises turn_timeout, so the task cap is always ≥ the turn cap. e2e_rpa_sdd and constraint_gate already had both.

2. `e2e_rpa_sdd` failed all 6 criteria on a filename mismatch

The agent wrote SDD_EmployeeOnboardingDataEntry.md, but the criteria check the exact kebab-case path employee-onboarding-data-entry-sdd.md. The prompt never stated the required output filename. pdd_to_sdd had the same gap.

Fix: state the kebab-case output filename in both prompts. This is skill-compatible — planner Critical Rule 6 honors a user-specified output path.

3. Cleanup

Removed the stale max_iterations: 1 top-level field on e2e_rpa_sdd that coder-eval plan flagged as unknown/ignored (default is already 1 iteration).

Files changed (7)

pdd_to_sdd.yaml — add task_timeout: 2700; state output filename
e2e_rpa_sdd.yaml — state output filename; drop stale max_iterations
gap_detection.yaml, product_selection.yaml — add task_timeout: 2700
smoke_file_reading.yaml, smoke_pdd_activation.yaml, smoke_skill_activation.yaml — add task_timeout: 2700 (coherence with the 1800s turn cap)

Validation

coder-eval plan on all 9 planner tasks → All tasks are valid (this also runs as the advisory validate-task-schema check here).
Behavioral run pending via the Run Coder Eval workflow (workflow_dispatch, task_globs: tasks/uipath-planner/pdd_to_sdd.yaml tasks/uipath-planner/e2e_rpa_sdd/e2e_rpa_sdd.yaml) — the auto smoke check only runs --tags smoke, so it does not cover these two SDD tasks.

🤖 Generated with Claude Code

…e prompt-gaps The planner SDD-authoring tasks set turn_timeout: 1800 but inherited the experiment default task_timeout: 1200, so the whole-task wall clock fired at ~1200s before the SDD landed (pdd_to_sdd timed out at 1204s). The per-turn cap also sat above the task cap, which is incoherent. Set task_timeout: 2700 on every planner task that raises turn_timeout so the task cap is always >= the turn cap. Neither pdd_to_sdd nor e2e_rpa_sdd stated the required kebab-case output filename, yet their criteria check exact paths -- the agent named the file SDD_EmployeeOnboardingDataEntry.md and all 6 criteria failed. State the filename in the prompt (planner Critical Rule 6 honors a user-specified output path). Also remove the stale, ignored max_iterations field on e2e_rpa_sdd that coder-eval plan flagged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-23T08:26:21Z

Claude finished @RaduAna-Maria's task in 2m 18s —— View job

PR Review: fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps

Read review criteria and project conventions
Analyze full diff (7 files, +28/−8)
Read all 7 changed files + 2 untouched planner tasks + experiment configs
Check for consistency and correctness
Post review

Summary

Fixes an incoherent timeout config (turn_timeout: 1800 > experiment-default task_timeout: 1200) across all planner tasks, adds explicit output filenames to two SDD prompts that failed on filename mismatch, and removes a stale max_iterations field.

Change-by-Change Review

1. `tests/tasks/uipath-planner/pdd_to_sdd.yaml` — `task_timeout` + filename

OK. task_timeout: 2700 fixes the turn > task incoherence (confirmed: default.yaml line 6 sets task_timeout: 1200). The prompt now names invoice-data-entry-sdd.md, which matches the success_criteria path: on line 85. The 5-line comment explains a non-obvious constraint — justified.

2. `tests/tasks/uipath-planner/e2e_rpa_sdd/e2e_rpa_sdd.yaml` — filename + `max_iterations` removal

OK. Prompt now names employee-onboarding-data-entry-sdd.md, matching the 6 success criteria that reference it. max_iterations: 1 correctly removed (default is 1). The existing task_timeout: 2700 / turn_timeout: 1800 pair was already correct here.

3. `tests/tasks/uipath-planner/gap_detection.yaml` — `task_timeout`

OK. Same timeout fix as the others. Coherent with constraint_gate.yaml which already had both set.

4. `tests/tasks/uipath-planner/product_selection.yaml` — `task_timeout`

OK. Same pattern. Note: run_limits is at the bottom of the file (line 116) rather than near the top — unusual ordering but not a correctness issue.

5–7. `smoke_file_reading.yaml`, `smoke_pdd_activation.yaml`, `smoke_skill_activation.yaml` — `task_timeout`

OK. All three get the 1-line comment variant + task_timeout: 2700. Clean and consistent.

What's Missing

Medium — Filename prompt-gap still open in 3 other SDD tasks. The PR fixes the filename gap in pdd_to_sdd and e2e_rpa_sdd, but three other tasks have the same pattern — success criteria check a specific kebab-case filename while the prompt never states it:
- gap_detection.yaml — criteria check payment-reconciliation-sdd.md, prompt doesn't name it
- product_selection.yaml — criteria check customer-support-agent-sdd.md, prompt doesn't name it
- constraint_gate/constraint_gate.yaml — criteria check purchase-request-approval-sdd.md, prompt doesn't name it
These haven't failed yet (the agent likely derives the name from the PDD's process name + the skill's kebab-case convention), but the same latent mismatch that burned e2e_rpa_sdd exists here. The fix is the same one-liner added to the other two prompts: state the output filename.

Fix this →
Low — bpmn_vs_case_management.yaml also has turn_timeout: 1800 without task_timeout. The task is skip: true so it can't fail in CI, but it has the same incoherent config. Not blocking.

Area Ratings

Area	Status	Notes
Frontmatter	OK	No frontmatter changes; all task_ids and tags unchanged
E2E Tests	OK	These are the test fixes; no new skill code changed
Skill Body	N/A	No skill files modified
References & Assets	N/A	No reference files modified
Repo Hygiene	OK	Changes scoped to `tests/tasks/uipath-planner/` only; no secrets or stray files

Issues for Manual Review

The PR description notes a behavioral run is pending via the Run Coder Eval workflow for pdd_to_sdd and e2e_rpa_sdd. Confirm those pass before merging — the timeout and filename fixes are sound on paper but the SDD authoring turn is long enough that other failure modes (model behavior, skill changes) could surface.

Conclusion

The core fixes are correct and well-scoped. The timeout incoherence is real (task_timeout: 1200 < turn_timeout: 1800), the filename gap caused a real failure, and the max_iterations removal is clean. The one gap is that the filename prompt-fix was applied to 2 of 5 affected SDD tasks — the other 3 have the same latent risk. I'd suggest adding the filename to those prompts in this PR while the pattern is fresh (one-liner each), but it's not blocking if those tasks haven't failed yet.

Verdict: Approve with the filename-gap suggestion as a non-blocking follow-up.

github-actions · 2026-06-23T08:26:21Z

Claude finished @RaduAna-Maria's task in 3m 47s —— View job

Coder-eval task lint (advisory)

7 task YAMLs changed; verdicts: 0 Critical, 0 High, 0 Medium, 1 Low, 6 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

❌ High — PR body does not claim the changed tasks have been run and passed. The body says coder-eval plan validates schema ("All tasks are valid") and a behavioral run is "pending via the Run Coder Eval workflow", but no passing-run result is claimed yet. Please edit the PR description to add a line like:

Ran skill-sdd-integration-pdd-to-sdd and skill-sdd-e2e-rpa-process locally and they passed.

Per-task lint

`tests/tasks/uipath-planner/e2e_rpa_sdd/e2e_rpa_sdd.yaml` — verdict: OK

`tests/tasks/uipath-planner/gap_detection.yaml` — verdict: OK (theme-captured; see Theme 1)

`tests/tasks/uipath-planner/pdd_to_sdd.yaml` — verdict: OK

`tests/tasks/uipath-planner/product_selection.yaml` — verdict: OK (theme-captured; see Theme 1)

`tests/tasks/uipath-planner/smoke_file_reading.yaml` — verdict: Low

Issues:

[Low] Could pass for the wrong reason (lines 23–24, 52–56): sentinel value QXR-9118-PURPLE is present in initial_prompt itself. The file_contains criterion at line 52 asserts this value in sentinel.txt as proof the Read tool was used, but an agent could write it directly from the prompt without reading pdd.md.

Suggested fixes:

Consider adding a command_executed criterion matching a Read-tool invocation on pdd.md (if the eval harness tracks tool calls), or use a two-file sentinel where the value written to pdd.md is computed from a transform the agent must apply after reading. Not blocking — this is a known smoke-test design trade-off.

`tests/tasks/uipath-planner/smoke_pdd_activation.yaml` — verdict: OK

`tests/tasks/uipath-planner/smoke_skill_activation.yaml` — verdict: OK

Within-PR duplicates

No duplicate clusters detected. All 7 files are modifications to existing tasks, not new additions.

Themes

Theme 1 (Low): gap_detection.yaml and product_selection.yaml have the same filename-not-in-prompt gap that this PR fixed in e2e_rpa_sdd and pdd_to_sdd. Neither prompt specifies the expected output filename (payment-reconciliation-sdd.md and customer-support-agent-sdd.md respectively), yet success_criteria check those exact paths. This is the same mismatch the PR describes as causing the e2e_rpa_sdd six-criteria failure. If these omissions are deliberate (testing the skill's kebab-case naming convention), consider documenting that intent in the task description; if not, add Save the SDD as \` in the current directory.` to each prompt, matching what was done for the two fixed tasks.

Conclusion

⚠ 1 task has a Low finding (smoke_file_reading sentinel design trade-off). 2 additional tasks share a filename-in-prompt gap addressed elsewhere in this PR (Theme 1, Low). Evidence of passing run is not yet confirmed — please update the PR description once behavioral runs complete. Advisory only — not blocking merge.

Run 28012844870 proved the filename + skill-routing fixes work: the agent loaded uipath-planner, wrote employee-onboarding-data-entry-sdd.md, and produced a complete 730-line SDD (all 18 sections + Planner Handoff). But the task still scored 0 -- the comprehensive Master-Project SDD authoring runs ~100k output tokens in one autonomous turn and hit turn_timeout: 1800s mid-wrap-up. agent_timeout is a hard ERROR, so the success criteria never ran against the finished file. The sibling pdd_to_sdd (single-product Invoice SDD) succeeded in 1136s, so 1800s is only tight for this one Master-Project task. Raise turn_timeout to 2700 (file was already complete at the cap; the turn just needs to return) and task_timeout to 3600 so criteria checking + container overhead fit above the turn budget. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…vation assertion The e2e re-run kept failing — but not on timeout. The agent free-formed a 1167-line SDD in 3 turns with tools={Write:1}, never loading the planner skill, so check_sdd.py matched none of the template's 18 sections (run 28012844870). When the skill DOES load it produces a correct SDD; activation was simply non-deterministic for an inline PDD. Fix (no force-load; fix activation for real): - Stage the PDD as `pdd.md` via sandbox.template_sources. A real PDD file is the strongest activation signal ("Always invoke for pdd.md files"), far more reliable than an inline PDD. - Natural prompt that references pdd.md (no "load the skill" hand-holding). - Add a command_executed assertion (tool_name: Skill, pattern: uipath-planner) so an activation miss fails crisply as an activation failure rather than as downstream check_sdd.py noise. Keeps the turn_timeout:2700 / task_timeout:3600 budget for when the skill loads and runs the full Master-Project authoring turn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

RaduAna-Maria mentioned this pull request Jun 23, 2026

test(uipath-planner): add Lane A task-derivation + single-project deferral coverage #1641

Draft

RaduAna-Maria assigned gabrielavaduva Jun 23, 2026

RaduAna-Maria added 3 commits June 23, 2026 18:27

test(uipath-planner): hardcode planner activation in e2e prompt

89afd70

test(uipath-planner): restore inline pdd and trim smoke timeout comments

8e675da

test(uipath-planner): use fixture pdd.md as sole e2e prompt source

dfcb7ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps#1636

fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps#1636
RaduAna-Maria wants to merge 6 commits into
mainfrom
fix/planner-sdd-test-repairs

RaduAna-Maria commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RaduAna-Maria commented Jun 23, 2026

Why

1. pdd_to_sdd timed out at 1204s (TIMEOUT, score 0)

2. e2e_rpa_sdd failed all 6 criteria on a filename mismatch

3. Cleanup

Files changed (7)

Validation

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps

Summary

Change-by-Change Review

1. tests/tasks/uipath-planner/pdd_to_sdd.yaml — task_timeout + filename

2. tests/tasks/uipath-planner/e2e_rpa_sdd/e2e_rpa_sdd.yaml — filename + max_iterations removal

3. tests/tasks/uipath-planner/gap_detection.yaml — task_timeout

4. tests/tasks/uipath-planner/product_selection.yaml — task_timeout

5–7. smoke_file_reading.yaml, smoke_pdd_activation.yaml, smoke_skill_activation.yaml — task_timeout

What's Missing

Area Ratings

Issues for Manual Review

Conclusion

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coder-eval task lint (advisory)

Evidence of passing run

Per-task lint

tests/tasks/uipath-planner/e2e_rpa_sdd/e2e_rpa_sdd.yaml — verdict: OK

tests/tasks/uipath-planner/gap_detection.yaml — verdict: OK (theme-captured; see Theme 1)

tests/tasks/uipath-planner/pdd_to_sdd.yaml — verdict: OK

tests/tasks/uipath-planner/product_selection.yaml — verdict: OK (theme-captured; see Theme 1)

tests/tasks/uipath-planner/smoke_file_reading.yaml — verdict: Low

tests/tasks/uipath-planner/smoke_pdd_activation.yaml — verdict: OK

tests/tasks/uipath-planner/smoke_skill_activation.yaml — verdict: OK

Within-PR duplicates

Themes

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `pdd_to_sdd` timed out at 1204s (TIMEOUT, score 0)

2. `e2e_rpa_sdd` failed all 6 criteria on a filename mismatch

github-actions Bot commented Jun 23, 2026 •

edited

Loading

1. `tests/tasks/uipath-planner/pdd_to_sdd.yaml` — `task_timeout` + filename

2. `tests/tasks/uipath-planner/e2e_rpa_sdd/e2e_rpa_sdd.yaml` — filename + `max_iterations` removal

3. `tests/tasks/uipath-planner/gap_detection.yaml` — `task_timeout`

4. `tests/tasks/uipath-planner/product_selection.yaml` — `task_timeout`

5–7. `smoke_file_reading.yaml`, `smoke_pdd_activation.yaml`, `smoke_skill_activation.yaml` — `task_timeout`

github-actions Bot commented Jun 23, 2026 •

edited

Loading

`tests/tasks/uipath-planner/e2e_rpa_sdd/e2e_rpa_sdd.yaml` — verdict: OK

`tests/tasks/uipath-planner/gap_detection.yaml` — verdict: OK (theme-captured; see Theme 1)

`tests/tasks/uipath-planner/pdd_to_sdd.yaml` — verdict: OK

`tests/tasks/uipath-planner/product_selection.yaml` — verdict: OK (theme-captured; see Theme 1)

`tests/tasks/uipath-planner/smoke_file_reading.yaml` — verdict: Low

`tests/tasks/uipath-planner/smoke_pdd_activation.yaml` — verdict: OK

`tests/tasks/uipath-planner/smoke_skill_activation.yaml` — verdict: OK