fix: bound iOS capture stalls and make runner recovery session-preserving (#1105)#1107
Merged
Conversation
…ving (#1105) Runner (Swift): - Coalesce duplicate transport sends of one commandId onto the in-flight execution instead of enqueueing them again behind it (capture pileup). - Fail fast with RUNNER_BUSY while watchdog-abandoned main-thread work is draining; escalate to RUNNER_WEDGED past 120s so the daemon recycles. - Carry the capture-plan deadline into the query-sweep and private-AX ladder tiers so chained recovery cannot stack past the watchdog. - Penalize the tree backend after a slow (>5s) or abandoned capture and lead subsequent regular plans with private-AX for that bundle (sticky, 120s), stamped recovered/budget so the deferral stays observable. Daemon (TS): - Per-request runner recycle budget: at most one invalidate+reboot per request, then fail fast with an actionable, session-preserving hint. - RUNNER_WEDGED joins the runner-fatal invalidation reasons. - Interaction commands (click/fill/longpress/press/type/get/is) preserve the daemon on request timeout like snapshot/wait/find: resetting it destroyed every healthy app session the daemon owned.
… capture XCTest records 'Failed to get matching snapshot: kAXErrorIllegalArgument' issues for every XCUIApplication query on AX-broken screens; after a few of them the test case tears down the moment the in-flight command completes, killing the long-lived runner after every capture of the screen (the restart loop behind #1105). The capture plan already classifies and recovers from AX failures, so this issue class is noise: swallow exactly it in record(_:); everything else still records and still drives XCTEST_RECORDED_FAILURE.
The tree snapshot XPC is a single blocking call whose duration moves with live content (4s to minutes on Bluesky profile screens); no in-process budget could bound it on the main thread. Run it on a worker bounded to an 8s slice: on timeout the plan penalizes the tree backend, skips the XCTest-backed tiers while the abandoned XPC drains (they would block behind it inside testmanagerd), and recovers through the private AX backend, which does not use testmanagerd.
The Bluesky profile tree grind measures ~4.5s before kAXErrorIllegalArgument, just under the old 5s threshold, so every capture re-paid the doomed grind (9s each). At 3s the second capture onward defers to private AX (2.4s snapshot, 4.9s press on the live repro).
Size Report
Startup median (7 runs, lower is better):
Top changed chunks:
|
- Require the kAXError token: 'Failed to get matching snapshot: Timed out while evaluating UI query.' is a genuinely-hung-query signal and must keep recording (and keep driving XCTEST_RECORDED_FAILURE). Sibling AX server codes (kAXErrorCannotComplete, ...) are deliberately included: any AX-server rejection inside a matching-snapshot fetch is the same capture-plan noise. - State honestly that the override is suite-global and why (tap-triggered queries record the same noise; command outcomes stay honest via their own error paths). - Lock-guarded suppressed-issue counter following the file's existing abandoned-work counter pattern, logged with each suppression. - Unit-test the pure classifier (record(_:) itself is not invoked: the must-record variants would record real failures in the test run).
Member
Author
|
Review defect addressed in 3fc481c:
Gate: format/typecheck/lint/layering green; Swift unit surface compiles (macOS CI-gate command). Full vitest runs on this machine currently show shifting spawn-timeout flakes from external CPU contention (five unrelated ~100%-CPU processes); every flagged file passes in isolation and none intersect this diff — CI remains the authoritative run. |
|
This was referenced Jul 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes the deepened form of #1105: on deep/dynamic screens (Bluesky profile pages with heavy embeds), iOS AX capture ground for 30-90s+, recovery paid for multiple full runner boots inside one request, the 90s client envelope then SIGKILLed the daemon on
press, and every app session died (SESSION_NOT_FOUNDuntil re-open).Root cause (three stacked failures, all verified live on the repro screen)
Failed to get matching snapshot: kAXErrorIllegalArgumentXCTest issues; after 3 accumulate, XCTest tears the test case down the instant the in-flight command completes. Verified on pristinemain:COMMAND_COMPLETED ok=1at 09:28:21.622 →Test Suite 'RunnerTests' failedat 09:28:21.624 (Executed 1 test, with 3 failures). Every subsequent command paid a fresh ~20s boot whose first capture died the same way — the restart loop, and why "a fresh runner also cannot capture this screen". The fix: add iOS private AX snapshot fallback #758 private-AX fallback itself is intact and fires correctly (recovered/private-ax, 133 nodes).kAXErrorIllegalArgument; now it grinds first — measured 4.5s on one attempt and >30s on another attempt of the same screen minutes apart (moves with live content). The single blockingsnapshot()XPC could not be bounded on the main thread; past the 30s watchdog the work is abandoned but keeps running, and the daemon transport re-sends the SAME commandId every ~2-20s (waitForRunnerconnect loop), each re-send accepted as a NEW execution queued behind the abandoned one.presshadonTimeout: 'reset-daemon', so the envelope timeout destroyed the daemon and all sessions.Fix
Runner (Swift)
RunnerTests.record(_:)swallows exactly theFailed to get matching snapshotissue class (capture-plan noise the plan already classifies/recovers); everything else still records and still drivesXCTEST_RECORDED_FAILURE. The runner now survives captures of AX-broken screens.RUNNER_BUSY(+ screenshot/coordinate hint); past 120s the runner reportsRUNNER_WEDGEDso the daemon recycles it (bounded below).effectiveSnapshotCapturePlan, iOS-sim only), stampedrecovered/budgetso the deferral stays observable. Raw diagnostic plans keep tree-first error propagation. (ADR 0004 Regression Notes updated.)Daemon (TS)
runner-recycle-ledger.ts: at most one runner recycle (invalidate + reboot) per request; the request's first cold boot stays free. On exhaustion, fail fast with an actionable hint ("app session is preserved… usescreenshot… navigate away") instead of paying another boot. Wired intoexecuteRunnerCommand(boot gate) andrestartSessionAndRunCommand.RUNNER_WEDGEDjoins the runner-fatal invalidation reasons (recycle-bounded).click,fill,longpress,press,type,get,is) now declarepreserve-daemonon timeout (fix: honor wait budgets, preserve the daemon on polling timeouts, refuse off-screen selector taps #1075 philosophy): their dominant hang is the same blocked capture assnapshot, and resetting the daemon destroyed every healthy session it owned. ADR-0011 text + the bounded-set gate test updated. No interaction-guarantee matrix cells changed — resolution semantics are untouched.Before / after on the live repro (iPhone 17 Pro sim, logged-in Bluesky dev build, alpenglow profile)
BEFORE (pristine main a055658):
The catastrophic nightly variant (92s press, two boots in one request, daemon SIGKILL, SESSION_NOT_FOUND) is the same wedge with worse grind timing; every ingredient was reproduced and removed individually (see root cause).
AFTER (this branch):
Every command ≤ ~18s (open included), zero runner restarts, zero session loss.
Additional live validations (iPhone 16 sim, modified runner):
RUNNER_BUSY+ actionable hint; after the abandoned work drains: healthy snapshot in 267ms, NO restart (AGENT_DEVICE_RUNNER_ABANDONED_WORK_DRAINED).pressrecovers in 6.2s with exactly one recycle; session preserved; next snapshot healthy in 358ms.NO_MATCH) leaves the session usable.Verification
pnpm format:check && pnpm typecheck && pnpm lint && node --experimental-strip-types scripts/layering/check.ts— all green.npx vitest run src test/integration(exit captured unpiped): 3388 tests; two runs each had 1-3 machine-contention flakes with disjoint failure sets (daemon-clientdownload-abort, simctl status-bar, provider doctor Metro probe — a real Metro was live on 8081 for the repro and could not be stopped); every flagged file passes in isolation and none intersect this diff.tap-point-policy-parity,src/contracts);contracts/fixtures/tap-point-policy.jsonuntouched; gesture semantics untouched.effectiveSnapshotCapturePlanand the penalty state; new TS unit tests for the recycle ledger and the fail-fast cap.Found but not fixed (follow-ups)
waitForRunnerstill re-POSTs the full read-only command as its connection probe; coalescing makes it benign, but a dedicated probe + long-poll would be cleaner.RUNNER_BUSYis surfaced to the caller; the daemon could transparently back off and retry read-only commands when the busy window is short.simctl installdoes not copy app data (fresh installs are signed out), and relaunched dev-client builds can land on the Expo dev-launcher; the repro script in this PR's transcripts navigated viabluesky://profile/…deep link.