feat: --settle returns the settled diff in the interaction response (#1101) by thymikee · Pull Request #1106 · callstack/agent-device

thymikee · 2026-07-04T20:25:08Z

Summary

Implements #1101: press|click|fill|longpress <target> --settle executes the action, waits for the UI to go quiet (the wait stable quiet-window loop, extracted to a shared stable-capture.ts and reused verbatim), and returns the settled observation in the same response — collapsing the dominant interact → observe agent-loop pair into one round trip.

Before / after:

press @e2            → Tapped @e2 (200, 322)
diff snapshot -i     → +@e4 [text] "Welcome!" …        (second call, second inference)

press @e2 --settle   → Tapped @e2 (200, 322)
                       settled after 812ms: +1 -1 (~14 unchanged)
                       + @e4 [text] "Welcome!"          (one call, one inference)

Design (as agreed on the issue):

Payload = settled DIFF vs the pre-action tree. Baseline is the --verify machinery's preActionNodes (ref/selector targets reuse the resolution capture; point targets opt into the evidence-baseline capture). The diff reuses the snapshot-diff machinery (buildSnapshotDiff, flattened like diff -i) and ships changed lines only (added/removed, bounded at 80 with a truncated marker); the unchanged bulk rides as diff.summary.unchanged. A full tree per interaction would invert the snapshot token-budget principle.
Best-effort, never an action failure. Quiet reached → settled: true + diff. Never-settling content (carousel/ticker) → the last capture's diff + settled: false + a hint (the tiny-tree hint style generalized). A broken or stalled settle capture reports itself in settle.hint; the action result is untouched.
Fresh refs ride along (Stale @refs silently resolve to the wrong node after the session tree changes #1076/feat: versioned snapshot refs with MCP auto-pinning #1096 integration). The final settled capture becomes the stored session snapshot; added diff lines carry structured ref bodies minted from it, and settle.refsGeneration rides the payload. A diff-carrying settle response is therefore ref-issuing: it clears snapshotRefsStale at the existing markSessionSnapshotRefsIssued choke point (the same accepted coarse blessing as find's single re-issued ref — documented at the choke point), and the MCP layer merge-only re-pins the added-line refs at the settle generation (SETTLE_REF_ISSUING_TOOLS beside REF_ISSUING_TOOLS; deliberately not IN REF_ISSUING_TOOLS — a plain press carries no generation and must never clear the pin scope). Diff-less settle payloads (stalled loop, sparse capture not stored) issue nothing and leave staleness untouched.
--settle --verify costs zero extra captures: the settle loop's final capture doubles as the verify evidence source.
Grammar: --settle (boolean opt-in) + --settle-quiet <ms> (quiet window, default 500ms) + the existing --timeout <ms> as the settle deadline (default 10s) — mirroring wait stable [quietMs] [timeoutMs]. Tuning flags without --settle are rejected (INVALID_ARGS). The four interaction descriptors declare a flag-sourced timeout budget with new envelope: 'widen' semantics: like wait's positional budget, --timeout extends the request envelope past the settle budget and never shrinks it (replay/prepare/snapshot keep their verbatim bound semantics).
Fast paths delegate: --settle disables the direct-iOS-selector and native-ref fast paths exactly like --verify (settling needs the tree-based baseline and captures).
Response construction stays single-site: the settle payload rides interactionResultExtra → buildInteractionResponseData (which injects refsGeneration); the construction guard passes unchanged.
Response levels: a conservative interactionSettleView digest keeps the verdict + diff.summary + refsGeneration and drops the line texts; non-settle responses pass through by reference at every level. full returns the default shape ("nothing richer is computed yet", like every existing view) — the issue's "full = whole interactive tree" is deferred because leveled views are pure functions of the default data, and the default payload must not carry the tree.

Guarantee matrix (ADR 0011): --settle adds no dispatch path, but it is a new cross-path response guarantee — added as the settleObservation row and classified for every path: runtime via settleAfterInteraction on runtime-selector / runtime-ref / coordinate, delegated (the flag disables the fast path) on direct-ios-selector / native-ref, inapplicable on the maestro replay path. No appliesTo scoping anywhere: the flag covers every command each path dispatches, and the gate rejects redundant full-coverage lists. One contract scenario per enforced/delegated cell (5 new); the registry-driven coverage gate enforces them. The gap/pin list is unchanged.

Default responses stay byte-identical without --settle (opt-in flag; the provider suite is the oracle and passes untouched). The snapshot-diff withRefs option is additive and off for the diff command, so its wire shape is unchanged too.

Draft on purpose: the issue's acceptance evidence — the Bluesky dogfood measurement (tokens + wall-clock per step for a fixed multi-step flow, --settle vs the two-call baseline) — comes after this lands on a simulator and is not included here. Live simulator verification, README/website docs, and a SkillGym planning case are deferred to that dogfood pass. Per-call cost when opted in: quiet window (default ~500ms) + captures.

Closes #1101.

Touched files: 48 — interaction command family plus its command-surface/MCP/descriptor/response-level projections; scope intentionally crosses those layers because the flag is a full command-surface addition (checklist steps 1–10).

Validation

Unit: settle-loop composition in src/commands/interaction/runtime/settle.test.ts (settled true/false under fake-clock budgets, never-fails-action, --verify capture sharing with capture counting, longpress path, diff line cap) and daemon response shape in src/daemon/handlers/__tests__/interaction-settle.test.ts (diff payload + refsGeneration, stale-marker clear, stale-input warning kept, diff-less observation leaves staleness untouched, tuning-flag guard, fill @ref wire shape). MCP re-pinning from a settle response (merge-only; plain presses never clear pins) in command-tools.test.ts; settle digest view in response-views.test.ts; widen-envelope derivation in daemon-client.test.ts; descriptor bounded-set updates in timeout-policy.test.ts.
Contract suite: five new settleObservation scenarios (runtime-selector, runtime-ref, coordinate, direct-ios-selector, native-ref) with coverage manifests; the honesty and coverage gates pass.
End-to-end transcript: test/integration/provider-scenarios/settle-observation.test.ts drives open → snapshot -i → press label= --settle against a scripted runner (tap → two stable captures), asserts the settled diff + fresh ref + refsGeneration, then acts on the diff-issued @e2 with no stale warning, and assertComplete() proves the exact runner conversation.
Gates: format:check, typecheck, lint, fallow audit --base origin/main clean (including deleting the pre-existing dead defineCommandDescriptor and extracting the shared runtime-tree matrix cells it flagged as duplication), vitest src/daemon src/commands src/contracts src/mcp src/cli (1330 passed), vitest test/integration, progress-model flag classification.

Also in this branch: the two doctor provider scenarios that AGENTS.md lists as the known contention flake failed 3/3 full-suite runs on this host at vitest's 5s default (4.9–5.0s of real harness work; they reproduce WITHOUT any of this PR's test additions and pass 3/3 in isolation). They now declare explicit 15s budgets, same in-file precedent as the Metro-probe scenario's 10s budget — separate commit.

Known gaps / follow-ups:

Bluesky dogfood measurement + live simulator evidence (why this is a draft; device-facing behavior is not merge-ready until then).
full-level whole-tree settle view (deferred, see above).
Ref-target baselines reuse the stored session snapshot, so a non--i stored tree can over-report removals in the settled diff — the same baseline caveat --verify's changedFromBefore already accepts; noted in code.
The shared stable loop now derives its poll cadence from the quiet window (min(300ms, max(25ms, quietMs))); wait stable at the default 500ms quiet window is byte-identical, sub-300ms quiet windows poll faster.

…1101) press/click/fill/longpress --settle executes the action, waits for the UI to go quiet (wait stable's loop, shared via stable-capture.ts), and returns the settled diff vs the pre-action tree in the same response — one round trip instead of the interact -> observe pair. - payload: changed lines only (bounded), summary counts, added-line refs, refsGeneration; best-effort (settled:false + hint on never-quiet content, never an action failure); --verify shares the settle captures - ref issuance: the settled tree becomes the session snapshot; a diff-carrying settle response clears snapshotRefsStale and the MCP layer merge-only re-pins added-line refs at the settle generation - grammar: --settle + --settle-quiet <ms> + --timeout <ms> (flag-sourced descriptor budget with new envelope:'widen' semantics mirroring wait) - ADR 0011: new settleObservation guarantee classified on every path with contract scenarios per enforced/delegated cell

The doctor provider scenarios sit at ~5s of real daemon-harness work on a loaded host and flake at vitest's 5s default during full-suite runs (the known contention flake AGENTS.md documents). Same in-file precedent as the Metro-probe scenario's 10s budget.

github-actions · 2026-07-04T20:25:38Z

Size Report

Metric	Base	Current	Diff
JS raw	1.5 MB	1.5 MB	+8.3 kB
JS gzip	491.3 kB	493.6 kB	+2.3 kB
npm tarball	590.3 kB	593.6 kB	+3.3 kB
npm unpacked	2.1 MB	2.1 MB	+10.1 kB

Startup median (7 runs, lower is better):

Scenario	Base	Current	Diff
CLI --version	21.5 ms	21.7 ms	+0.1 ms
CLI --help	40.8 ms	42.0 ms	+1.2 ms

Top changed chunks:

Chunk	Raw diff	Gzip diff
`dist/src/registry.js`	+125.7 kB	+36.7 kB
`dist/src/device-rotation.js`	+71.4 kB	+23.2 kB
`dist/src/runtime.js`	-3.9 kB	-1.3 kB
`dist/src/cli-help.js`	+485 B	+161 B
`dist/src/selector-runtime.js`	+461 B	+118 B

daemon/handlers/interaction-flags.ts imported the type across the daemon -> commands boundary (R2 commands-floor). The tuning params are part of the interaction contract like SettleObservation, so they live in contracts/interaction.ts and both layers import from there.

…in the cap Bluesky dogfood: a fill that summons the iOS keyboard spent 49 of the 80 capped diff lines spelling out QWERTY keys, and a screen transition with 269 removals could starve out the added lines entirely. Key-type nodes are now filtered from both diff sides (the [keyboard] container line still signals presence), and under truncation added lines — the ones carrying fresh refs — win slots over removals.

thymikee · 2026-07-04T20:50:43Z

Reviewed blessing-scenario-first, then dogfooded on Bluesky (real sim). Verdict: the core semantics are right, and two content-policy fixes landed from dogfooding.

Review (primary-justification scenarios, all verified in code + tests):

Generation integrity: storeSettledSnapshot routes through createDaemonRuntimeSessionStore.setRecord → setSessionSnapshot, so the generation bumps before settleRefsGenerationIssue reads it. Settle refs pin at G+1, pre-press pins stay at G and warn precisely — no re-blessing (the feat: versioned snapshot refs with MCP auto-pinning #1096 failure mode is structurally absent).
MCP merge path is conditional on settle.refsGeneration + diff presence; a plain press never touches the pin scope (tested). Removed lines never carry ref (enforced in the diff builder, documented on SnapshotDiffLine), so dead baseline refs can't be pinned.
diff ⇔ refs-issued holds at the source: the diff is only attached when the settled tree became the stored snapshot (non-sparse), and the e2e proves the follow-up press @e2 resolves on the stored settled tree with zero extra runner requests.

Fixes pushed to this branch:

SettleParams moved to contracts/interaction.ts — the daemon import of commands/…/settle.ts violated the layering DAG (the local gate chain didn't include scripts/layering/check.ts; worth adding to the AGENTS.md gate list).
Content-first diff policy, from Bluesky dogfood: a fill that summons the keyboard spent 49/80 capped lines on QWERTY [key] nodes, and a screen transition with 269 removals could starve out added lines entirely. Key-type nodes are now filtered from both diff sides ([keyboard] container survives as the signal) and added lines (the fresh-ref half) win cap slots over removals. Unit-tested.

Dogfood evidence (iPhone 17 Pro sim, Bluesky dev build): the flagship loop worked with zero intermediate snapshots — press @e69 --settle (search tab) returned a diff whose added lines included @e23 [text-field] "Search", and fill @e23 alpenglow --settle acted on it directly. Settle waits observed: 1.0–2.3s, 3–4 captures, payload ~11KB vs ~28KB for a full snapshot -i.

Known trade-off confirmed on-device: network-backed content (search typeahead) can arrive after the quiet window — the diff honestly reflects the settled-but-still-loading screen and the agent needs a follow-up (wait text). This is the designed behavior; the end-to-end benchmark now running will show whether it matters in practice.

Benchmarked with headless haiku/sonnet agents given only --help: both models skipped the help-workflow pointer and started with plain snapshot (38KB payloads they then had to re-read from files). One core-loop line at the starting point is what teaches snapshot -i and --settle to models that never read a second help page.

thymikee · 2026-07-04T23:48:30Z

End-to-end agent benchmark: acceptance evidence

Headless claude -p agents (Bash tool only, max 40 turns) drove the real CLI against Bluesky on an iPhone 17 Pro simulator. Task: search "alpenglow" → report first account handle → Notifications tab → first label → print DONE: line. Hard environment reset between runs. 27 runs total across arms; success = correct DONE line.

Steered prompt: classic act→observe loop vs --settle loop

arm	success	wall (med)	turns (med)	cost (med)
classic × haiku	0/3	174s (all max-turns)	41	$0.31
settle × haiku	3/3	111s	19	$0.25
classic × sonnet	3/3	269s	35	$1.23
settle × sonnet	3/3	358s	36	$1.50

--settle takes Haiku from 0% to 100%. Classic-Haiku drowns in observation management (25 snapshots + 18 file re-reads in 40 turns; 1.4–2.1M cache-read tokens). With --settle it finishes in ~18–19 turns at a quarter the wall time of Sonnet. For Sonnet, --settle is reliability-neutral (3/3 both) and mildly wall/cost-negative in this sample — Sonnet already runs an efficient diff-snapshot loop, and settle's per-action quiet-window waits (~1–3s each) don't buy it turns. Matches the opt-in design: it's a small-model/reliability feature, not a universal accelerator.

Instruction minimalism: does the CLI teach itself?

Bare prompt (no tips), or bare prompt + "run --help first":

Sonnet: 5/5 across minimal + help-discovery. Best run overall: 12 turns / 137s / $0.54 — it read --help, discovered --settle on its own, and used it on every mutating action. The help-workflow line ships the whole loop to strong models with zero steering.
Haiku, original help: 1/4 — falls into plain-snapshot 38KB-payload spirals.
Haiku, after adding one core-loop line to the top-level help (open → snapshot -i → press/fill --settle → repeat, commit e619c1c): 2/3, both successes at 21–24 turns following the taught loop exactly (snapshot -i first, --settle everywhere, one-retry recovery from the @-prefix hint). The single failure shows five 90-second inter-turn gaps — API rate limiting, not agent-device.

Environment reliability

The profile-screen capture wedge (#1105, deterministic repro posted there) was routed around in this benchmark and is being fixed on fix/ios-capture-stall-recovery. Two further findings filed: iOS fill types at ~700ms/char (task chip), and small models blind themselves with 2>/dev/null | jq '.data.x' projections of error JSON — the existing stdout-JSON error contract is what saves them when they drop the projection.

Verdict: the PR does what it set out to do. The settled-diff loop is the difference between a small model failing 100% and succeeding 100% on a real multi-screen task, and the loop is discoverable from --help alone by both model classes once the core-loop line states it. Recommending un-drafting after the #1105 fix lands.

thymikee added 2 commits July 4, 2026 22:14

thymikee added 2 commits July 4, 2026 22:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: --settle returns the settled diff in the interaction response (#1101)#1106

feat: --settle returns the settled diff in the interaction response (#1101)#1106
thymikee wants to merge 5 commits into
mainfrom
feat/interaction-settle

thymikee commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

thymikee commented Jul 4, 2026

Uh oh!

thymikee commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

thymikee commented Jul 4, 2026

Summary

Validation

Uh oh!

github-actions Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Size Report

Uh oh!

thymikee commented Jul 4, 2026

Uh oh!

thymikee commented Jul 4, 2026

End-to-end agent benchmark: acceptance evidence

Steered prompt: classic act→observe loop vs --settle loop

Instruction minimalism: does the CLI teach itself?

Environment reliability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jul 4, 2026 •

edited

Loading