Skip to content

fix: mark losing-branch nodes as cancelled after df.race completes#178

Draft
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-instance-nodes-status
Draft

fix: mark losing-branch nodes as cancelled after df.race completes#178
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-instance-nodes-status

Conversation

Copy link
Copy Markdown

Copilot AI commented May 27, 2026

After a df.race() completes, the losing branch's sub-orchestration is cancelled by duroxide internally, but df.nodes rows were never updated — leaving them as running or pending indefinitely. Dashboards and df.instance_nodes queries would show ghost in-flight work for already-completed instances.

Schema

nodes_status_chk on df.nodes is widened to include 'cancelled'. Migration in sql/pg_durable--0.2.2--0.2.3.sql (constraint added NOT VALID — no table scan on upgrade). Version bumped to 0.2.3.

New activity: cancel_subtree_nodes

Bulk UPDATE df.nodes SET status = 'cancelled' WHERE id = ANY($1) AND status NOT IN ('completed', 'failed', 'cancelled'). The terminal-state guard ensures nodes that happened to complete before select2() returned are not disturbed.

Orchestration fix: execute_race_node

After ctx.select2() returns the winner, the losing branch root is known. A collect_subtree_node_ids() DFS walk collects every node reachable from that root (following left_node, right_node, condition_node, and extra_nodes), then schedules cancel_subtree_nodes. Cancellation is best-effort — a failure logs a warning but does not fail the workflow.

-- Before: df.instance_nodes after race completes
 node_type | status
-----------+---------
 SQL       | completed   ← winner
 SLEEP     | running     ← ghost

-- After
 node_type | status
-----------+-----------
 SQL       | completed
 SLEEP     | cancelled

Tests

tests/e2e/sql/24_race_loser_cancelled.sql covers two scenarios: a single-node losing branch (SLEEP) and a multi-node losing branch (THEN + SQL sequence). Both use a poll loop rather than a fixed sleep to wait for the cancel activity to settle.

Copilot AI and others added 2 commits May 27, 2026 14:21
After a df.race() workflow completes, the losing branch's nodes in
df.instance_nodes were left in 'running' or 'pending' status, causing
ghost in-flight work to appear in diagnostics and dashboards.

Changes:
- Add 'cancelled' to nodes_status_chk constraint (src/lib.rs)
- Create migration sql/pg_durable--0.2.2--0.2.3.sql to widen constraint
- Bump version to 0.2.3 in Cargo.toml
- Add cancel_subtree_nodes activity (bulk-cancels non-terminal nodes)
- Register new activity in activities/mod.rs and registry.rs
- Add collect_subtree_node_ids() helper (DFS traversal of node graph)
- Modify execute_race_node() to cancel losing-branch nodes after winner
- Add e2e test 24_race_loser_cancelled.sql
- Update CHANGELOG.md and docs/upgrade-testing.md

Co-authored-by: pinodeca <32303022+pinodeca@users.noreply.github.com>
- Update comment in cancel_subtree_nodes.rs to list all three terminal
  statuses ('completed', 'failed', 'cancelled') that are guarded
- Add trace logging when cancel-subtree-nodes activity fails (best-effort,
  does not fail the workflow) instead of silently discarding the error
- Replace hardcoded pg_sleep(2) in e2e test with poll loop (50 x 200ms)
  to avoid flakiness while still detecting the settled state quickly

Co-authored-by: pinodeca <32303022+pinodeca@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix instance nodes showing race-loser nodes as running or pending fix: mark losing-branch nodes as cancelled after df.race completes May 27, 2026
Copilot AI requested a review from pinodeca May 27, 2026 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

df.instance_nodes leaves race-loser nodes running or pending after race completion

2 participants