Skip to content

fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps#1636

Open
RaduAna-Maria wants to merge 6 commits into
mainfrom
fix/planner-sdd-test-repairs
Open

fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps#1636
RaduAna-Maria wants to merge 6 commits into
mainfrom
fix/planner-sdd-test-repairs

Conversation

@RaduAna-Maria

Copy link
Copy Markdown
Contributor

Why

Two unrelated failures in the uipath-planner SDD tasks, plus a latent config issue across the whole planner suite.

1. pdd_to_sdd timed out at 1204s (TIMEOUT, score 0)

The task raised turn_timeout: 1800 but never set task_timeout, so it inherited the experiment default task_timeout: 1200. The whole-task wall clock fired before the SDD landed. gap_detection and product_selection had the same bug; every planner task that sets turn_timeout: 1800 was also leaving the task cap below the turn cap — an incoherent config (a turn can't outlive its task).

Fix: set task_timeout: 2700 on every planner task that raises turn_timeout, so the task cap is always ≥ the turn cap. e2e_rpa_sdd and constraint_gate already had both.

2. e2e_rpa_sdd failed all 6 criteria on a filename mismatch

The agent wrote SDD_EmployeeOnboardingDataEntry.md, but the criteria check the exact kebab-case path employee-onboarding-data-entry-sdd.md. The prompt never stated the required output filename. pdd_to_sdd had the same gap.

Fix: state the kebab-case output filename in both prompts. This is skill-compatible — planner Critical Rule 6 honors a user-specified output path.

3. Cleanup

Removed the stale max_iterations: 1 top-level field on e2e_rpa_sdd that coder-eval plan flagged as unknown/ignored (default is already 1 iteration).

Files changed (7)

  • pdd_to_sdd.yaml — add task_timeout: 2700; state output filename
  • e2e_rpa_sdd.yaml — state output filename; drop stale max_iterations
  • gap_detection.yaml, product_selection.yaml — add task_timeout: 2700
  • smoke_file_reading.yaml, smoke_pdd_activation.yaml, smoke_skill_activation.yaml — add task_timeout: 2700 (coherence with the 1800s turn cap)

Validation

  • coder-eval plan on all 9 planner tasks → All tasks are valid (this also runs as the advisory validate-task-schema check here).
  • Behavioral run pending via the Run Coder Eval workflow (workflow_dispatch, task_globs: tasks/uipath-planner/pdd_to_sdd.yaml tasks/uipath-planner/e2e_rpa_sdd/e2e_rpa_sdd.yaml) — the auto smoke check only runs --tags smoke, so it does not cover these two SDD tasks.

🤖 Generated with Claude Code

…e prompt-gaps

The planner SDD-authoring tasks set turn_timeout: 1800 but inherited the
experiment default task_timeout: 1200, so the whole-task wall clock fired
at ~1200s before the SDD landed (pdd_to_sdd timed out at 1204s). The
per-turn cap also sat above the task cap, which is incoherent. Set
task_timeout: 2700 on every planner task that raises turn_timeout so the
task cap is always >= the turn cap.

Neither pdd_to_sdd nor e2e_rpa_sdd stated the required kebab-case output
filename, yet their criteria check exact paths -- the agent named the file
SDD_EmployeeOnboardingDataEntry.md and all 6 criteria failed. State the
filename in the prompt (planner Critical Rule 6 honors a user-specified
output path).

Also remove the stale, ignored max_iterations field on e2e_rpa_sdd that
coder-eval plan flagged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Claude finished @RaduAna-Maria's task in 2m 18s —— View job


PR Review: fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps

  • Read review criteria and project conventions
  • Analyze full diff (7 files, +28/−8)
  • Read all 7 changed files + 2 untouched planner tasks + experiment configs
  • Check for consistency and correctness
  • Post review

Summary

Fixes an incoherent timeout config (turn_timeout: 1800 > experiment-default task_timeout: 1200) across all planner tasks, adds explicit output filenames to two SDD prompts that failed on filename mismatch, and removes a stale max_iterations field.

Change-by-Change Review

1. tests/tasks/uipath-planner/pdd_to_sdd.yamltask_timeout + filename

OK. task_timeout: 2700 fixes the turn > task incoherence (confirmed: default.yaml line 6 sets task_timeout: 1200). The prompt now names invoice-data-entry-sdd.md, which matches the success_criteria path: on line 85. The 5-line comment explains a non-obvious constraint — justified.

2. tests/tasks/uipath-planner/e2e_rpa_sdd/e2e_rpa_sdd.yaml — filename + max_iterations removal

OK. Prompt now names employee-onboarding-data-entry-sdd.md, matching the 6 success criteria that reference it. max_iterations: 1 correctly removed (default is 1). The existing task_timeout: 2700 / turn_timeout: 1800 pair was already correct here.

3. tests/tasks/uipath-planner/gap_detection.yamltask_timeout

OK. Same timeout fix as the others. Coherent with constraint_gate.yaml which already had both set.

4. tests/tasks/uipath-planner/product_selection.yamltask_timeout

OK. Same pattern. Note: run_limits is at the bottom of the file (line 116) rather than near the top — unusual ordering but not a correctness issue.

5–7. smoke_file_reading.yaml, smoke_pdd_activation.yaml, smoke_skill_activation.yamltask_timeout

OK. All three get the 1-line comment variant + task_timeout: 2700. Clean and consistent.

What's Missing

  • Medium — Filename prompt-gap still open in 3 other SDD tasks. The PR fixes the filename gap in pdd_to_sdd and e2e_rpa_sdd, but three other tasks have the same pattern — success criteria check a specific kebab-case filename while the prompt never states it:

    • gap_detection.yaml — criteria check payment-reconciliation-sdd.md, prompt doesn't name it
    • product_selection.yaml — criteria check customer-support-agent-sdd.md, prompt doesn't name it
    • constraint_gate/constraint_gate.yaml — criteria check purchase-request-approval-sdd.md, prompt doesn't name it

    These haven't failed yet (the agent likely derives the name from the PDD's process name + the skill's kebab-case convention), but the same latent mismatch that burned e2e_rpa_sdd exists here. The fix is the same one-liner added to the other two prompts: state the output filename.

    Fix this →

  • Low — bpmn_vs_case_management.yaml also has turn_timeout: 1800 without task_timeout. The task is skip: true so it can't fail in CI, but it has the same incoherent config. Not blocking.

Area Ratings

Area Status Notes
Frontmatter OK No frontmatter changes; all task_ids and tags unchanged
E2E Tests OK These are the test fixes; no new skill code changed
Skill Body N/A No skill files modified
References & Assets N/A No reference files modified
Repo Hygiene OK Changes scoped to tests/tasks/uipath-planner/ only; no secrets or stray files

Issues for Manual Review

  • The PR description notes a behavioral run is pending via the Run Coder Eval workflow for pdd_to_sdd and e2e_rpa_sdd. Confirm those pass before merging — the timeout and filename fixes are sound on paper but the SDD authoring turn is long enough that other failure modes (model behavior, skill changes) could surface.

Conclusion

The core fixes are correct and well-scoped. The timeout incoherence is real (task_timeout: 1200 < turn_timeout: 1800), the filename gap caused a real failure, and the max_iterations removal is clean. The one gap is that the filename prompt-fix was applied to 2 of 5 affected SDD tasks — the other 3 have the same latent risk. I'd suggest adding the filename to those prompts in this PR while the pattern is fresh (one-liner each), but it's not blocking if those tasks haven't failed yet.

Verdict: Approve with the filename-gap suggestion as a non-blocking follow-up.


@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Claude finished @RaduAna-Maria's task in 3m 47s —— View job


Coder-eval task lint (advisory)

7 task YAMLs changed; verdicts: 0 Critical, 0 High, 0 Medium, 1 Low, 6 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

High — PR body does not claim the changed tasks have been run and passed. The body says coder-eval plan validates schema ("All tasks are valid") and a behavioral run is "pending via the Run Coder Eval workflow", but no passing-run result is claimed yet. Please edit the PR description to add a line like:

Ran skill-sdd-integration-pdd-to-sdd and skill-sdd-e2e-rpa-process locally and they passed.

Per-task lint

tests/tasks/uipath-planner/e2e_rpa_sdd/e2e_rpa_sdd.yaml — verdict: OK

tests/tasks/uipath-planner/gap_detection.yaml — verdict: OK (theme-captured; see Theme 1)

tests/tasks/uipath-planner/pdd_to_sdd.yaml — verdict: OK

tests/tasks/uipath-planner/product_selection.yaml — verdict: OK (theme-captured; see Theme 1)

tests/tasks/uipath-planner/smoke_file_reading.yaml — verdict: Low

Issues:

  • [Low] Could pass for the wrong reason (lines 23–24, 52–56): sentinel value QXR-9118-PURPLE is present in initial_prompt itself. The file_contains criterion at line 52 asserts this value in sentinel.txt as proof the Read tool was used, but an agent could write it directly from the prompt without reading pdd.md.

Suggested fixes:

  • Consider adding a command_executed criterion matching a Read-tool invocation on pdd.md (if the eval harness tracks tool calls), or use a two-file sentinel where the value written to pdd.md is computed from a transform the agent must apply after reading. Not blocking — this is a known smoke-test design trade-off.

tests/tasks/uipath-planner/smoke_pdd_activation.yaml — verdict: OK

tests/tasks/uipath-planner/smoke_skill_activation.yaml — verdict: OK

Within-PR duplicates

No duplicate clusters detected. All 7 files are modifications to existing tasks, not new additions.

Themes

  • Theme 1 (Low): gap_detection.yaml and product_selection.yaml have the same filename-not-in-prompt gap that this PR fixed in e2e_rpa_sdd and pdd_to_sdd. Neither prompt specifies the expected output filename (payment-reconciliation-sdd.md and customer-support-agent-sdd.md respectively), yet success_criteria check those exact paths. This is the same mismatch the PR describes as causing the e2e_rpa_sdd six-criteria failure. If these omissions are deliberate (testing the skill's kebab-case naming convention), consider documenting that intent in the task description; if not, add Save the SDD as \` in the current directory.` to each prompt, matching what was done for the two fixed tasks.

Conclusion

⚠ 1 task has a Low finding (smoke_file_reading sentinel design trade-off). 2 additional tasks share a filename-in-prompt gap addressed elsewhere in this PR (Theme 1, Low). Evidence of passing run is not yet confirmed — please update the PR description once behavioral runs complete. Advisory only — not blocking merge.


Run 28012844870 proved the filename + skill-routing fixes work: the agent
loaded uipath-planner, wrote employee-onboarding-data-entry-sdd.md, and
produced a complete 730-line SDD (all 18 sections + Planner Handoff). But
the task still scored 0 -- the comprehensive Master-Project SDD authoring
runs ~100k output tokens in one autonomous turn and hit turn_timeout: 1800s
mid-wrap-up. agent_timeout is a hard ERROR, so the success criteria never
ran against the finished file.

The sibling pdd_to_sdd (single-product Invoice SDD) succeeded in 1136s, so
1800s is only tight for this one Master-Project task. Raise turn_timeout to
2700 (file was already complete at the cap; the turn just needs to return)
and task_timeout to 3600 so criteria checking + container overhead fit above
the turn budget.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vation assertion

The e2e re-run kept failing — but not on timeout. The agent free-formed a
1167-line SDD in 3 turns with tools={Write:1}, never loading the planner skill,
so check_sdd.py matched none of the template's 18 sections (run 28012844870).
When the skill DOES load it produces a correct SDD; activation was simply
non-deterministic for an inline PDD.

Fix (no force-load; fix activation for real):
- Stage the PDD as `pdd.md` via sandbox.template_sources. A real PDD file is the
  strongest activation signal ("Always invoke for pdd.md files"), far more
  reliable than an inline PDD.
- Natural prompt that references pdd.md (no "load the skill" hand-holding).
- Add a command_executed assertion (tool_name: Skill, pattern: uipath-planner)
  so an activation miss fails crisply as an activation failure rather than as
  downstream check_sdd.py noise.

Keeps the turn_timeout:2700 / task_timeout:3600 budget for when the skill loads
and runs the full Master-Project authoring turn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants