fix: mark losing-branch nodes as cancelled after df.race completes#178
Draft
Copilot wants to merge 3 commits into
Draft
fix: mark losing-branch nodes as cancelled after df.race completes#178Copilot wants to merge 3 commits into
cancelled after df.race completes#178Copilot wants to merge 3 commits into
Conversation
After a df.race() workflow completes, the losing branch's nodes in df.instance_nodes were left in 'running' or 'pending' status, causing ghost in-flight work to appear in diagnostics and dashboards. Changes: - Add 'cancelled' to nodes_status_chk constraint (src/lib.rs) - Create migration sql/pg_durable--0.2.2--0.2.3.sql to widen constraint - Bump version to 0.2.3 in Cargo.toml - Add cancel_subtree_nodes activity (bulk-cancels non-terminal nodes) - Register new activity in activities/mod.rs and registry.rs - Add collect_subtree_node_ids() helper (DFS traversal of node graph) - Modify execute_race_node() to cancel losing-branch nodes after winner - Add e2e test 24_race_loser_cancelled.sql - Update CHANGELOG.md and docs/upgrade-testing.md Co-authored-by: pinodeca <32303022+pinodeca@users.noreply.github.com>
- Update comment in cancel_subtree_nodes.rs to list all three terminal
statuses ('completed', 'failed', 'cancelled') that are guarded
- Add trace logging when cancel-subtree-nodes activity fails (best-effort,
does not fail the workflow) instead of silently discarding the error
- Replace hardcoded pg_sleep(2) in e2e test with poll loop (50 x 200ms)
to avoid flakiness while still detecting the settled state quickly
Co-authored-by: pinodeca <32303022+pinodeca@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix instance nodes showing race-loser nodes as running or pending
fix: mark losing-branch nodes as May 27, 2026
cancelled after df.race completes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After a
df.race()completes, the losing branch's sub-orchestration is cancelled by duroxide internally, butdf.nodesrows were never updated — leaving them asrunningorpendingindefinitely. Dashboards anddf.instance_nodesqueries would show ghost in-flight work for already-completed instances.Schema
nodes_status_chkondf.nodesis widened to include'cancelled'. Migration insql/pg_durable--0.2.2--0.2.3.sql(constraint addedNOT VALID— no table scan on upgrade). Version bumped to0.2.3.New activity:
cancel_subtree_nodesBulk
UPDATE df.nodes SET status = 'cancelled' WHERE id = ANY($1) AND status NOT IN ('completed', 'failed', 'cancelled'). The terminal-state guard ensures nodes that happened to complete beforeselect2()returned are not disturbed.Orchestration fix:
execute_race_nodeAfter
ctx.select2()returns the winner, the losing branch root is known. Acollect_subtree_node_ids()DFS walk collects every node reachable from that root (followingleft_node,right_node,condition_node, andextra_nodes), then schedulescancel_subtree_nodes. Cancellation is best-effort — a failure logs a warning but does not fail the workflow.Tests
tests/e2e/sql/24_race_loser_cancelled.sqlcovers two scenarios: a single-node losing branch (SLEEP) and a multi-node losing branch (THEN + SQL sequence). Both use a poll loop rather than a fixed sleep to wait for the cancel activity to settle.