Skip to content

fix(menubar): run CLI exit-wait and timeout off the cooperative pool#426

Merged
iamtoruk merged 1 commit into
mainfrom
fix/menubar-dataclient-deadlock
Jun 2, 2026
Merged

fix(menubar): run CLI exit-wait and timeout off the cooperative pool#426
iamtoruk merged 1 commit into
mainfrom
fix/menubar-dataclient-deadlock

Conversation

@iamtoruk

@iamtoruk iamtoruk commented Jun 2, 2026

Copy link
Copy Markdown
Member

Summary

The menubar wedged on "Loading Today…" for hours after an idle period. This moves the two blocking points in DataClient.runCLI off Swift's cooperative thread pool so a saturated pool can no longer deadlock the CLI calls or their timeout.

Root cause

DataClient.runCLI called the blocking process.waitUntilExit() from an async function on the cooperative thread pool. On a 16-core machine, 16 concurrent slow codeburn subprocesses pinned all 16 cooperative threads inside waitUntilExit; the 45s timeout — itself a Task on that same pool — could then never be scheduled to kill them, so the deadlock was permanent. Confirmed via sample: 16/16 cooperative threads parked in waitUntilExit.

PR #412 (AppStore inFlightKeys bookkeeping) sat a layer above this OS-thread deadlock and could not fix it.

Fix

  • Bridge waitUntilExit through a global (overcommit) queue via a continuation.
  • Drive the timeout from a DispatchSource on a global queue so it fires even when the cooperative pool is saturated.
  • Extract runProcess for testability.

Testing

  • New DataClientProcessTests: concurrency + timeout smoke test, output/exit-code test.
  • Running clean on a local dev build through a multi-hour idle soak — the exact scenario that previously wedged.

The menubar wedged on "Loading Today…" for hours after an idle period.
Root cause: DataClient.runCLI called the blocking process.waitUntilExit()
from an async function on Swift's cooperative thread pool. On a 16-core
machine, 16 concurrent slow `codeburn` subprocesses pinned all 16
cooperative threads inside waitUntilExit; the 45s timeout — itself a Task
on that same pool — could then never be scheduled to kill them, so the
deadlock was permanent. Confirmed via sample: 16/16 cooperative threads
parked in waitUntilExit. PR #412 (AppStore inFlightKeys bookkeeping) was a
layer above the OS-thread deadlock and could not fix it.

Move both blocking points off the cooperative pool: bridge waitUntilExit
through a global (overcommit) queue via a continuation, and drive the
timeout from a DispatchSource on a global queue so it fires even when the
pool is saturated. Extract runProcess for testability; add a concurrency +
timeout smoke test and an output/exit-code test.
@iamtoruk iamtoruk merged commit bec0491 into main Jun 2, 2026
3 checks passed
iamtoruk added a commit that referenced this pull request Jun 9, 2026
…read; cap CLI spawns

The #426 fix moved waitUntilExit and its timeout onto the same global(qos:.utility)
queue. Under sustained load every utility worker blocked in waitUntilExit, so the
timeout could never be scheduled to kill them and the menubar wedged on Loading
forever (confirmed via sample after ~a week of soak). Await process.terminationHandler
(fires on a Foundation queue, blocks no worker) so the timeout always has a free
thread. Add an actor-based async semaphore capping concurrent CLI spawns at 6.
iamtoruk added a commit that referenced this pull request Jun 9, 2026
…read; cap CLI spawns (#462)

The #426 fix moved waitUntilExit and its timeout onto the same global(qos:.utility)
queue. Under sustained load every utility worker blocked in waitUntilExit, so the
timeout could never be scheduled to kill them and the menubar wedged on Loading
forever (confirmed via sample after ~a week of soak). Await process.terminationHandler
(fires on a Foundation queue, blocks no worker) so the timeout always has a free
thread. Add an actor-based async semaphore capping concurrent CLI spawns at 6.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant