fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps#1636
fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps#1636RaduAna-Maria wants to merge 6 commits into
Conversation
…e prompt-gaps The planner SDD-authoring tasks set turn_timeout: 1800 but inherited the experiment default task_timeout: 1200, so the whole-task wall clock fired at ~1200s before the SDD landed (pdd_to_sdd timed out at 1204s). The per-turn cap also sat above the task cap, which is incoherent. Set task_timeout: 2700 on every planner task that raises turn_timeout so the task cap is always >= the turn cap. Neither pdd_to_sdd nor e2e_rpa_sdd stated the required kebab-case output filename, yet their criteria check exact paths -- the agent named the file SDD_EmployeeOnboardingDataEntry.md and all 6 criteria failed. State the filename in the prompt (planner Critical Rule 6 honors a user-specified output path). Also remove the stale, ignored max_iterations field on e2e_rpa_sdd that coder-eval plan flagged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @RaduAna-Maria's task in 2m 18s —— View job PR Review: fix(test): coherent timeouts on all planner tasks + close SDD filename prompt-gaps
SummaryFixes an incoherent timeout config ( Change-by-Change Review1.
|
| Area | Status | Notes |
|---|---|---|
| Frontmatter | OK | No frontmatter changes; all task_ids and tags unchanged |
| E2E Tests | OK | These are the test fixes; no new skill code changed |
| Skill Body | N/A | No skill files modified |
| References & Assets | N/A | No reference files modified |
| Repo Hygiene | OK | Changes scoped to tests/tasks/uipath-planner/ only; no secrets or stray files |
Issues for Manual Review
- The PR description notes a behavioral run is pending via the Run Coder Eval workflow for
pdd_to_sddande2e_rpa_sdd. Confirm those pass before merging — the timeout and filename fixes are sound on paper but the SDD authoring turn is long enough that other failure modes (model behavior, skill changes) could surface.
Conclusion
The core fixes are correct and well-scoped. The timeout incoherence is real (task_timeout: 1200 < turn_timeout: 1800), the filename gap caused a real failure, and the max_iterations removal is clean. The one gap is that the filename prompt-fix was applied to 2 of 5 affected SDD tasks — the other 3 have the same latent risk. I'd suggest adding the filename to those prompts in this PR while the pattern is fresh (one-liner each), but it's not blocking if those tasks haven't failed yet.
Verdict: Approve with the filename-gap suggestion as a non-blocking follow-up.
|
Claude finished @RaduAna-Maria's task in 3m 47s —— View job Coder-eval task lint (advisory)7 task YAMLs changed; verdicts: 0 Critical, 0 High, 0 Medium, 1 Low, 6 OK. Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge. Evidence of passing run❌ High — PR body does not claim the changed tasks have been run and passed. The body says
Per-task lint
|
Run 28012844870 proved the filename + skill-routing fixes work: the agent loaded uipath-planner, wrote employee-onboarding-data-entry-sdd.md, and produced a complete 730-line SDD (all 18 sections + Planner Handoff). But the task still scored 0 -- the comprehensive Master-Project SDD authoring runs ~100k output tokens in one autonomous turn and hit turn_timeout: 1800s mid-wrap-up. agent_timeout is a hard ERROR, so the success criteria never ran against the finished file. The sibling pdd_to_sdd (single-product Invoice SDD) succeeded in 1136s, so 1800s is only tight for this one Master-Project task. Raise turn_timeout to 2700 (file was already complete at the cap; the turn just needs to return) and task_timeout to 3600 so criteria checking + container overhead fit above the turn budget. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vation assertion
The e2e re-run kept failing — but not on timeout. The agent free-formed a
1167-line SDD in 3 turns with tools={Write:1}, never loading the planner skill,
so check_sdd.py matched none of the template's 18 sections (run 28012844870).
When the skill DOES load it produces a correct SDD; activation was simply
non-deterministic for an inline PDD.
Fix (no force-load; fix activation for real):
- Stage the PDD as `pdd.md` via sandbox.template_sources. A real PDD file is the
strongest activation signal ("Always invoke for pdd.md files"), far more
reliable than an inline PDD.
- Natural prompt that references pdd.md (no "load the skill" hand-holding).
- Add a command_executed assertion (tool_name: Skill, pattern: uipath-planner)
so an activation miss fails crisply as an activation failure rather than as
downstream check_sdd.py noise.
Keeps the turn_timeout:2700 / task_timeout:3600 budget for when the skill loads
and runs the full Master-Project authoring turn.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Why
Two unrelated failures in the
uipath-plannerSDD tasks, plus a latent config issue across the whole planner suite.1.
pdd_to_sddtimed out at 1204s (TIMEOUT, score 0)The task raised
turn_timeout: 1800but never settask_timeout, so it inherited the experiment defaulttask_timeout: 1200. The whole-task wall clock fired before the SDD landed.gap_detectionandproduct_selectionhad the same bug; every planner task that setsturn_timeout: 1800was also leaving the task cap below the turn cap — an incoherent config (a turn can't outlive its task).Fix: set
task_timeout: 2700on every planner task that raisesturn_timeout, so the task cap is always ≥ the turn cap.e2e_rpa_sddandconstraint_gatealready had both.2.
e2e_rpa_sddfailed all 6 criteria on a filename mismatchThe agent wrote
SDD_EmployeeOnboardingDataEntry.md, but the criteria check the exact kebab-case pathemployee-onboarding-data-entry-sdd.md. The prompt never stated the required output filename.pdd_to_sddhad the same gap.Fix: state the kebab-case output filename in both prompts. This is skill-compatible — planner Critical Rule 6 honors a user-specified output path.
3. Cleanup
Removed the stale
max_iterations: 1top-level field one2e_rpa_sddthatcoder-eval planflagged as unknown/ignored (default is already 1 iteration).Files changed (7)
pdd_to_sdd.yaml— addtask_timeout: 2700; state output filenamee2e_rpa_sdd.yaml— state output filename; drop stalemax_iterationsgap_detection.yaml,product_selection.yaml— addtask_timeout: 2700smoke_file_reading.yaml,smoke_pdd_activation.yaml,smoke_skill_activation.yaml— addtask_timeout: 2700(coherence with the 1800s turn cap)Validation
coder-eval planon all 9 planner tasks → All tasks are valid (this also runs as the advisoryvalidate-task-schemacheck here).workflow_dispatch,task_globs: tasks/uipath-planner/pdd_to_sdd.yaml tasks/uipath-planner/e2e_rpa_sdd/e2e_rpa_sdd.yaml) — the auto smoke check only runs--tags smoke, so it does not cover these two SDD tasks.🤖 Generated with Claude Code