Skip to content

fix: bound iOS capture stalls and make runner recovery session-preserving (#1105)#1107

Merged
thymikee merged 5 commits into
mainfrom
fix/ios-capture-stall-recovery
Jul 5, 2026
Merged

fix: bound iOS capture stalls and make runner recovery session-preserving (#1105)#1107
thymikee merged 5 commits into
mainfrom
fix/ios-capture-stall-recovery

Conversation

@thymikee

@thymikee thymikee commented Jul 5, 2026

Copy link
Copy Markdown
Member

Fixes the deepened form of #1105: on deep/dynamic screens (Bluesky profile pages with heavy embeds), iOS AX capture ground for 30-90s+, recovery paid for multiple full runner boots inside one request, the 90s client envelope then SIGKILLed the daemon on press, and every app session died (SESSION_NOT_FOUND until re-open).

Root cause (three stacked failures, all verified live on the repro screen)

  1. The runner process died after every capture of an AX-broken screen. During capture, XCUIApplication queries record Failed to get matching snapshot: kAXErrorIllegalArgument XCTest issues; after 3 accumulate, XCTest tears the test case down the instant the in-flight command completes. Verified on pristine main: COMMAND_COMPLETED ok=1 at 09:28:21.622 → Test Suite 'RunnerTests' failed at 09:28:21.624 (Executed 1 test, with 3 failures). Every subsequent command paid a fresh ~20s boot whose first capture died the same way — the restart loop, and why "a fresh runner also cannot capture this screen". The fix: add iOS private AX snapshot fallback #758 private-AX fallback itself is intact and fires correctly (recovered/private-ax, 133 nodes).
  2. The tree backend's failure shape changed from fail-fast to slow-grind, and retries piled up. In fix: add iOS private AX snapshot fallback #758's era the tree failed instantly with kAXErrorIllegalArgument; now it grinds first — measured 4.5s on one attempt and >30s on another attempt of the same screen minutes apart (moves with live content). The single blocking snapshot() XPC could not be bounded on the main thread; past the 30s watchdog the work is abandoned but keeps running, and the daemon transport re-sends the SAME commandId every ~2-20s (waitForRunner connect loop), each re-send accepted as a NEW execution queued behind the abandoned one.
  3. The daemon paid unbounded recovery and then killed itself. Read-only retry × restart-and-replay allowed 2+ runner recycles (~25s each) per request; press had onTimeout: 'reset-daemon', so the envelope timeout destroyed the daemon and all sessions.

Fix

Runner (Swift)

  • RunnerTests.record(_:) swallows exactly the Failed to get matching snapshot issue class (capture-plan noise the plan already classifies/recovers); everything else still records and still drives XCTEST_RECORDED_FAILURE. The runner now survives captures of AX-broken screens.
  • Time-sliced tree capture: the blocking tree-snapshot XPC runs on a worker thread bounded to an 8s slice (also capped by the 20s plan deadline). On timeout the plan penalizes the tree backend and recovers through private-AX; while the abandoned XPC drains, plans skip the XCTest-backed tiers (they'd block behind it inside testmanagerd) — private-AX does not use testmanagerd.
  • Duplicate-commandId coalescing: transport re-sends of a still-executing commandId attach as waiters of the in-flight execution instead of enqueueing a second execution.
  • Busy guard + wedge escalation: while watchdog-abandoned main-thread work drains, new main-thread commands fail fast with RUNNER_BUSY (+ screenshot/coordinate hint); past 120s the runner reports RUNNER_WEDGED so the daemon recycles it (bounded below).
  • Plan budget into the tiers: the query sweep and private-AX depth-ladder rungs now stop when the 20s plan deadline is spent.
  • Cross-attempt memory: a slow (>3s) or abandoned tree capture penalizes the tree backend for that bundle for 120s; subsequent regular plans lead with private-AX (effectiveSnapshotCapturePlan, iOS-sim only), stamped recovered/budget so the deferral stays observable. Raw diagnostic plans keep tree-first error propagation. (ADR 0004 Regression Notes updated.)

Daemon (TS)

  • New runner-recycle-ledger.ts: at most one runner recycle (invalidate + reboot) per request; the request's first cold boot stays free. On exhaustion, fail fast with an actionable hint ("app session is preserved… use screenshot… navigate away") instead of paying another boot. Wired into executeRunnerCommand (boot gate) and restartSessionAndRunCommand.
  • RUNNER_WEDGED joins the runner-fatal invalidation reasons (recycle-bounded).
  • Capture-resolving interaction commands (click, fill, longpress, press, type, get, is) now declare preserve-daemon on timeout (fix: honor wait budgets, preserve the daemon on polling timeouts, refuse off-screen selector taps #1075 philosophy): their dominant hang is the same blocked capture as snapshot, and resetting the daemon destroyed every healthy session it owned. ADR-0011 text + the bounded-set gate test updated. No interaction-guarantee matrix cells changed — resolution semantics are untouched.

Before / after on the live repro (iPhone 17 Pro sim, logged-in Bluesky dev build, alpenglow profile)

BEFORE (pristine main a055658):

[09:24:37] STEP 09-snap-profile rc=0 wall_ms=8994    # recovered/private-ax; tree ground then kAXError; ladder depth 56
[09:25:13] STEP 10-press-notifications rc=1 wall_ms=35724   # XCTEST_RECORDED_FAILURE, "runner session will be restarted"
runner.log: test case tears down 2ms after the snapshot completes; 3 runner boots across the short flow

The catastrophic nightly variant (92s press, two boots in one request, daemon SIGKILL, SESSION_NOT_FOUND) is the same wedge with worse grind timing; every ingredient was reproduced and removed individually (see root cause).

AFTER (this branch):

[09:42:46] STEP 03-snap-profile rc=0 wall_ms=9144    # first contact pays one bounded grind; penalty set
[09:42:48] STEP 04-snap-profile2 rc=0 wall_ms=2419   # recovered/private-ax, tree deferred, 133 nodes
[09:42:53] STEP 05-press-notifications rc=0 wall_ms=4863   # SUCCESS
[09:42:54] STEP 06-snap-after rc=0 wall_ms=248       # usable tree, session intact
[09:42:54] STEP 07-snap-again rc=0 wall_ms=213

Every command ≤ ~18s (open included), zero runner restarts, zero session loss.

Additional live validations (iPhone 16 sim, modified runner):

  • Forced watchdog abandonment → concurrent snapshot fails in 219ms with RUNNER_BUSY + actionable hint; after the abandoned work drains: healthy snapshot in 267ms, NO restart (AGENT_DEVICE_RUNNER_ABANDONED_WORK_DRAINED).
  • SIGKILL the runner mid-session → next mutating press recovers in 6.2s with exactly one recycle; session preserved; next snapshot healthy in 358ms.
  • Failed press (NO_MATCH) leaves the session usable.

Verification

  • pnpm format:check && pnpm typecheck && pnpm lint && node --experimental-strip-types scripts/layering/check.ts — all green.
  • npx vitest run src test/integration (exit captured unpiped): 3388 tests; two runs each had 1-3 machine-contention flakes with disjoint failure sets (daemon-client download-abort, simctl status-bar, provider doctor Metro probe — a real Metro was live on 8081 for the repro and could not be stopped); every flagged file passes in isolation and none intersect this diff.
  • Contract suites green (tap-point-policy-parity, src/contracts); contracts/fixtures/tap-point-policy.json untouched; gesture semantics untouched.
  • Swift runner compiled with the unit-test surface for macOS (the CI gate command) and iOS sim; new in-bundle unit tests for effectiveSnapshotCapturePlan and the penalty state; new TS unit tests for the recycle ledger and the fail-fast cap.
  • Not run: the react-navigation Maestro replay suite — Metro on 8081 was serving the Bluesky dev build for the live repro (hard rule: don't kill it) and the machine owner asked to limit simulator load. Flagging explicitly per the brief; it should run in CI/on a free machine before merge.

Found but not fixed (follow-ups)

  • The FIRST capture of a hostile screen still pays one bounded ~8-9s attempt before the penalty engages, and penalty memory is per-runner-process. The durable cure for Bluesky-class screens remains ADR 0004's host-side AX-service backend.
  • While the 120s penalty window is active, sibling screens of the same app also use private-AX (e.g. the Notifications screen returned 90-97 usable nodes via private-ax instead of a healthy tree). Honest verdicts make this visible; a per-screen key would be finer.
  • waitForRunner still re-POSTs the full read-only command as its connection probe; coalescing makes it benign, but a dedicated probe + long-poll would be cleaner.
  • RUNNER_BUSY is surfaced to the caller; the daemon could transparently back off and retry read-only commands when the busy window is short.
  • Repro tooling notes: simctl install does not copy app data (fresh installs are signed out), and relaunched dev-client builds can land on the Expo dev-launcher; the repro script in this PR's transcripts navigated via bluesky://profile/… deep link.

thymikee added 4 commits July 5, 2026 09:26
…ving (#1105)

Runner (Swift):
- Coalesce duplicate transport sends of one commandId onto the in-flight
  execution instead of enqueueing them again behind it (capture pileup).
- Fail fast with RUNNER_BUSY while watchdog-abandoned main-thread work is
  draining; escalate to RUNNER_WEDGED past 120s so the daemon recycles.
- Carry the capture-plan deadline into the query-sweep and private-AX
  ladder tiers so chained recovery cannot stack past the watchdog.
- Penalize the tree backend after a slow (>5s) or abandoned capture and
  lead subsequent regular plans with private-AX for that bundle (sticky,
  120s), stamped recovered/budget so the deferral stays observable.

Daemon (TS):
- Per-request runner recycle budget: at most one invalidate+reboot per
  request, then fail fast with an actionable, session-preserving hint.
- RUNNER_WEDGED joins the runner-fatal invalidation reasons.
- Interaction commands (click/fill/longpress/press/type/get/is) preserve
  the daemon on request timeout like snapshot/wait/find: resetting it
  destroyed every healthy app session the daemon owned.
… capture

XCTest records 'Failed to get matching snapshot: kAXErrorIllegalArgument'
issues for every XCUIApplication query on AX-broken screens; after a few
of them the test case tears down the moment the in-flight command
completes, killing the long-lived runner after every capture of the
screen (the restart loop behind #1105). The capture plan already
classifies and recovers from AX failures, so this issue class is noise:
swallow exactly it in record(_:); everything else still records and
still drives XCTEST_RECORDED_FAILURE.
The tree snapshot XPC is a single blocking call whose duration moves
with live content (4s to minutes on Bluesky profile screens); no
in-process budget could bound it on the main thread. Run it on a worker
bounded to an 8s slice: on timeout the plan penalizes the tree backend,
skips the XCTest-backed tiers while the abandoned XPC drains (they
would block behind it inside testmanagerd), and recovers through the
private AX backend, which does not use testmanagerd.
The Bluesky profile tree grind measures ~4.5s before kAXErrorIllegalArgument,
just under the old 5s threshold, so every capture re-paid the doomed grind
(9s each). At 3s the second capture onward defers to private AX (2.4s
snapshot, 4.9s press on the live repro).
@github-actions

github-actions Bot commented Jul 5, 2026

Copy link
Copy Markdown

Size Report

Metric Base Current Diff
JS raw 1.5 MB 1.5 MB +1.6 kB
JS gzip 491.3 kB 491.8 kB +566 B
npm tarball 590.3 kB 597.4 kB +7.0 kB
npm unpacked 2.1 MB 2.1 MB +23.5 kB

Startup median (7 runs, lower is better):

Scenario Base Current Diff
CLI --version 26.6 ms 27.3 ms +0.6 ms
CLI --help 51.6 ms 51.6 ms -0.1 ms

Top changed chunks:

Chunk Raw diff Gzip diff
dist/src/runner-client.js +1.6 kB +557 B
dist/src/batch-policy.js 0 B +9 B

- Require the kAXError token: 'Failed to get matching snapshot: Timed out
  while evaluating UI query.' is a genuinely-hung-query signal and must
  keep recording (and keep driving XCTEST_RECORDED_FAILURE). Sibling AX
  server codes (kAXErrorCannotComplete, ...) are deliberately included:
  any AX-server rejection inside a matching-snapshot fetch is the same
  capture-plan noise.
- State honestly that the override is suite-global and why (tap-triggered
  queries record the same noise; command outcomes stay honest via their
  own error paths).
- Lock-guarded suppressed-issue counter following the file's existing
  abandoned-work counter pattern, logged with each suppression.
- Unit-test the pure classifier (record(_:) itself is not invoked: the
  must-record variants would record real failures in the test run).
@thymikee

thymikee commented Jul 5, 2026

Copy link
Copy Markdown
Member Author

Review defect addressed in 3fc481c:

  • Narrowed swallow condition: now requires both Failed to get matching snapshot AND the kAXError token. The hung-query variant ("Timed out while evaluating UI query.") keeps recording and keeps driving XCTEST_RECORDED_FAILURE. Sibling AX server codes (kAXErrorCannotComplete etc.) are deliberately in the swallow class — any AX-server rejection inside a matching-snapshot fetch is the same capture-plan noise — and the comment says so.
  • Honest scope comment: the override is documented as suite-global (all commands), with the rationale (tap-triggered queries record the same noise and would still tear the runner down; command outcomes stay honest via their own error paths — only the issue side-channel is muted).
  • Observability: lock-guarded suppressedAxSnapshotIssueCount following the existing abandoned-work counter pattern; each suppression logs the running count. (Not surfaced in the status payload — noted as the optional bonus.)
  • Unit test: testSuppressedAxSnapshotIssueClassifier on the gated unit surface pins kAXError→suppressed, kAXErrorCannotComplete→suppressed, timeout-variant→recorded, unrelated→recorded, kAXError-outside-fetch→recorded. It tests the pure classifier rather than calling record(_:) directly, because feeding record() the must-record variants would record real failures inside the test run itself; the override body is a trivial delegation to the classifier.

Gate: format/typecheck/lint/layering green; Swift unit surface compiles (macOS CI-gate command). Full vitest runs on this machine currently show shifting spawn-timeout flakes from external CPU contention (five unrelated ~100%-CPU processes); every flagged file passes in isolation and none intersect this diff — CI remains the authoritative run.

@thymikee thymikee marked this pull request as ready for review July 5, 2026 08:08
@thymikee thymikee merged commit 83d5461 into main Jul 5, 2026
20 checks passed
@thymikee thymikee deleted the fix/ios-capture-stall-recovery branch July 5, 2026 08:08
@github-actions

github-actions Bot commented Jul 5, 2026

Copy link
Copy Markdown
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-07-05 08:08 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant