diff --git a/AGENTS.md b/AGENTS.md index 62066e7a8..c9b735ab1 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -29,6 +29,17 @@ Single-context repo. Read `CONTEXT.md` for domain language and testing/architect - Define verifiable success criteria before editing. - Decide docs/skills impact up front. +## Principles (expensive lessons — each cost an incident) +- Guarantees erode at path boundaries. Any new dispatch path or fast path classifies its cells in `src/contracts/interaction-guarantees.ts` first; tsc forces completeness, you supply honesty. ADR 0011. +- A registry claim is not a semantic check: never mark a cell `runner` without reading whether the Swift code implements the guarantee's *definition*, not just a similar-sounding behavior. +- Delegation-on-error is not success-path parity. A fast path that falls back on failure can still succeed on a candidate the shared rules would refuse. +- Do not measure before confirming the code path can fire. An A/B whose B-arm cannot execute returns two green runs masquerading as evidence. +- Typed signals over message sniffing: key on structured details (`details.timeoutMs`, reason codes), never on error text. Remaining sniffs are owned debt with in-code rationale — do not copy the pattern. +- Snapshot output is the token budget. Never add per-node bytes to the tree; response-level metadata rides once per response. +- Warnings compose, never clobber. Append through the shared response builder; two clobber bugs shipped before this rule. +- Unreleased API surface dies free. Before treating a field as wire-compat, check `git tag --contains `; if it never shipped, delete it now. +- Push only behind `&&`-chained gates. `format:check && typecheck && lint && vitest && git push` — a push that can run after a failed gate eventually will. + ## Scope & Changes - Keep changes scoped to one command family or module group unless the task explicitly crosses boundaries. If scope expands, stop and confirm. - Preserve daemon session semantics and platform behavior. @@ -38,44 +49,27 @@ Single-context repo. Read `CONTEXT.md` for domain language and testing/architect - Test through public interfaces when possible. Do not add unrelated exports just to make tests easier. - Prefer type-level checks when TypeScript can enforce a contract or invalid shape. - Use `unknown` only at trust boundaries: parsed JSON, daemon/runtime payloads, catch values, generic I/O, or parser callbacks. Once a value is validated or its producer has a known contract, narrow to a domain type or focused parser/helper instead of carrying `unknown` through internal helper and formatter signatures. -- Keep modules small for agent context safety: - - target <= 300 LOC per implementation file when practical. - - if a file grows past 500 LOC, plan/extract focused submodules before adding new behavior. - - if a file grows past 1,000 LOC, treat it as architecture debt unless it is generated data, a fixture snapshot, or an integration test aggregation. - - long guidance/data tables should live behind focused modules instead of sharing a file with parser/runtime logic. +- Keep modules small for agent context safety. The unit is not lines, it is questions: a file should answer one question, so `rg` -> read-whole-file stays one cheap bounded read. + - numeric tripwires: target <= 300 LOC per implementation file; past 500, extract before adding behavior; past 1,000 is architecture debt unless it is generated data or a fixture snapshot. There is no exemption for tests (see below). + - name files by the domain concept they answer (`runner-cache.ts`, `interaction-touch-response.ts`), not by layer leftovers (`utils2.ts`, `common.ts` accretion). + - colocate machine-readable claims with the code they describe: coverage manifests beside contract tests, registry cells beside enforcement pointers, decision comments at the decision site — agents navigate by claims, not by directory listings. + - test files mirror source topology 1:1: when a source module splits, split its test file the same way in the same PR. A 3,000-line family test aggregation makes every fixture lookup a whole-file read; the worst offenders (`interaction.test.ts`, platform `index.test.ts`) predate this rule and shrink opportunistically — do not add to them. + - shared fixtures live as named exports in a sibling fixtures module (see `test/integration/interaction-contract/fixtures.ts`), never as inline literals repeated per test. + - long guidance/data tables live behind focused modules instead of sharing a file with parser/runtime logic. + - barrels only at package boundaries; internal barrels add a navigation hop per read. Legacy internal barrels are gated for removal (CONTEXT.md). - prefer deep modules over mechanical splits: extract when it improves locality for a concept callers already need, not just to reduce line count. - Before finalizing a code change, do one tightening pass over touched and directly adjacent areas: drop obsolete code, redundant tests, stale helpers/fixtures, and needless duplication made unnecessary by the change. - Prefer existing helpers. Add a helper only when it reduces real repetition or clarifies domain behavior. - When adding new guidance, examples, schemas, or command metadata, decide whether it belongs in the command surface, CLI grammar, CLI help, MCP projection, or daemon runtime before editing. - Prefer updating existing domain vocabulary in `CONTEXT.md` when naming a new durable module concept. Do not coin parallel names in docs, tests, and code. -## Routing -- Keep `src/daemon.ts` as a thin router. -- Keep command names centralized in `src/command-catalog.ts`; do not re-create command identity sets in handlers or request policy modules. -- Keep daemon routing and request-policy traits centralized in `src/daemon/daemon-command-registry.ts`; request modules should consume its predicates instead of recreating command string sets. See `docs/adr/0003-daemon-command-registry.md`. -- Keep command input/output contracts in the command modules: - - command surface and shared schemas: `src/commands/command-surface.ts`, `src/commands/command-contract.ts`, `src/commands/command-input.ts` - - typed client command execution: `src/commands/client-command-contracts.ts` - - command families: `src/commands/interaction-command-contracts.ts`, `src/commands/batch-command.ts`, with other typed client contracts in `src/commands/client-command-contracts.ts` - - CLI positional/flag grammar: `src/commands/cli-grammar.ts` and `src/commands/cli-grammar/*` - - typed input to daemon request projection: `src/commands/command-projection.ts` - - CLI/client/runtime output projection: `src/commands/cli-output.ts`, `src/commands/client-output.ts`, `src/commands/runtime-output.ts` -- Do not reintroduce CLI-shaped command adapters or schemas as a second source of truth. CLI, Node.js, and MCP should project from command contracts. -- Keep `src/daemon/request-router.ts` as request orchestration: auth, diagnostics scope, request admission, locking, handler chain, and fallback dispatch. -- New daemon handler-family commands must update `src/daemon/daemon-command-registry.ts` with the route and request-policy traits. `src/daemon/__tests__/daemon-command-registry.test.ts` guards route and policy traits; handler catalog tests keep executable handler sanity checks. -- Put request policies in focused request modules: - - tenant/lease/selector/lock admission: `src/daemon/request-admission.ts` - - artifact/error finalization: `src/daemon/request-finalization.ts` - - request-scoped platform provider scoping: `src/daemon/request-platform-providers.ts` - - generic fallback dispatch + action recording: `src/daemon/request-generic-dispatch.ts` - - recording invalidation health: `src/daemon/request-recording-health.ts` -- Put command logic in handler modules: - - session/apps/appstate/open/close/replay/logs: `src/daemon/handlers/session.ts` - - click/fill/get/is: `src/daemon/handlers/interaction.ts` - - snapshot/wait/alert/settings: `src/daemon/handlers/snapshot.ts` - - find: `src/daemon/handlers/find.ts` - - record/trace: `src/daemon/handlers/record-trace.ts` -- Commands routed as generic in `src/daemon/daemon-command-registry.ts` fall through to daemon fallback dispatch after specialized handlers return null. +## Routing & command identity (read the registries, not this file) +Command identity, routing, capability, and request-policy traits are *derived* artifacts — inspect their declaration sites instead of prose maps: +- one `CommandDescriptor` per command: `src/core/command-descriptor/registry.ts` (catalog, capabilities, MCP/CLI projection, batch policy, timeout policy — ADR 0008) +- daemon route ownership + request-policy traits: `src/daemon/daemon-command-registry.ts` (parity-tested) +- interaction dispatch paths × guarantees: `src/contracts/interaction-guarantees.ts` (ADR 0011) +- command names: `src/command-catalog.ts`; never re-create command string sets in handlers +Keep `src/daemon.ts` a thin router and `src/daemon/request-router.ts` orchestration-only. New daemon handler-family commands update the daemon command registry; its tests guard the traits. ## Toolchain Snapshot - Package manager: `pnpm` only. Do not add or restore `package-lock.json`. @@ -84,7 +78,8 @@ Single-context repo. Read `CONTEXT.md` for domain language and testing/architect - Lint/format stack is OXC: - config: `.oxlintrc.json`, `.oxfmtrc.json` - TypeScript is strict enough to surface dead code early: `strict`, `isolatedModules`, `noUnusedLocals`, and `noUnusedParameters` are enabled. -- The repo emits with `tsdown` (Rolldown), not `tsc`. If declaration generation fails, inspect `tsconfig.lib.json` first. +- The repo emits with `tsdown` (Rolldown), not `tsc`; `pnpm typecheck` runs `tsgo` (native preview), with `typecheck:tsc` as the fallback. If declaration generation fails, inspect `tsconfig.lib.json` first. +- Dev-loop staleness has three layers; after editing runtime or runner code: `pnpm build` (dist), restart the daemon (it does not self-reload), and remember `shutdown` deliberately HANDS OFF a healthy simulator runner — the adopted runner keeps serving the old Swift binary until you kill its process or the source fingerprint changes. Verifying "my change did nothing" against an adopted runner is a classic false negative. - `tsconfig.lib.json` needs an explicit `rootDir: "./src"` for declaration layout. - Use the aggregate scripts in `package.json` when possible; they encode the expected validation bundles better than ad hoc command lists. @@ -100,13 +95,6 @@ Single-context repo. Read `CONTEXT.md` for domain language and testing/architect - If build/type errors mention declaration generation, inspect `tsconfig.lib.json` before reading platform code. - If lint failures appear after toolchain edits, check whether the rule is from `eslint/*`, `typescript/*`, `import/*`, or `node/*` in `.oxlintrc.json` before assuming source bugs. -## Command Family Lookup -- `logs`: `src/daemon/handlers/session.ts` -> `src/daemon/app-log.ts` -> `src/daemon/handlers/__tests__/session.test.ts` -- `open/close/replay/apps/appstate`: `src/daemon/handlers/session.ts` -> `src/daemon/session-store.ts` -> `src/daemon/handlers/__tests__/session.test.ts` -- `click/fill/get/is`: `src/daemon/handlers/interaction.ts` -> `src/daemon/selectors.ts` -> `src/daemon/handlers/__tests__/interaction.test.ts` -- `snapshot/wait/settings/alert`: `src/daemon/handlers/snapshot.ts` -> `src/daemon/snapshot-processing.ts` -> `src/daemon/handlers/__tests__/snapshot-handler.test.ts` -- `record/trace`: `src/daemon/handlers/record-trace.ts` -> `src/platforms/apple/core/runner/runner-client.ts` -> `src/daemon/handlers/__tests__/record-trace.test.ts` - ## Apple Runner Seams - The OS-agnostic Apple XCTest runner lives under `src/platforms/apple/core/runner/`. Keep dependency direction clean: - `runner-client.ts`: command execution + retry behavior @@ -130,7 +118,20 @@ A new snapshot/command flag touches only the layers that need to understand it. 7. `src/daemon/context.ts` and `src/core/dispatch-context.ts`: add the field only when it flows into platform dispatch. 8. Handler/platform modules: thread the option only after the command surface, grammar, and projection prove it belongs there. -Command-only flags (like `find --first`) that do not flow to the platform layer usually stop at steps 1-3. +9. `scripts/integration-progress-model.ts`: classify the flag (device-observable vs intentionally-outside) — the architecture-progress gate fails CI on unclassified public flags. +10. If the flag changes interaction semantics, revisit the affected cells in `src/contracts/interaction-guarantees.ts` (command scoping via `appliesTo` when the flag exists only on some commands). + +Command-only flags (like `find --first`) that do not flow to the platform layer usually stop at steps 1-3 (plus step 9). + +## Enforcement gates (when one fails, it located your incomplete change) +This repo encodes invariants as self-declaring gates. The correct response to a gate failure is to classify/cover the new thing, never to suppress or allowlist: +- public CLI flags must be classified: `scripts/integration-progress-model.ts` +- interaction guarantee matrix completeness + honesty: `src/contracts/__tests__/interaction-guarantees.test.ts` (gap waivers need `trackingIssue`; the pin list changes only in reviewed diffs) +- every enforced/delegated matrix cell needs a contract scenario: `src/contracts/__tests__/interaction-contract-coverage.test.ts` + `test/integration/interaction-contract/` +- interaction responses build only through `buildInteractionResponseData`: the construction-guard test +- every command declares a timeout policy on its descriptor: the timeout-policy completeness test +- TS/Swift rule parity: golden tables under `contracts/fixtures/` consumed by vitest and the gated XCTest +- cross-command apple-leak guard, folder DAG/import lint, fallow (dead code, duplication, complexity) ## Hard Rules - Use process helpers from `src/utils/exec.ts` for TypeScript process execution: `runCmd`, `runCmdStreaming`, `runCmdSync`, `runCmdBackground`, and `runCmdDetached`. Do not import raw `spawn`/`spawnSync` outside `src/utils/exec.ts`; add or extend an exec helper instead. Plain `.mjs` packaging fixtures that cannot import TypeScript helpers should keep child-process usage local and prefer `execFile`/`execFileSync` over spawn. @@ -175,6 +176,11 @@ Command-only flags (like `find --first`) that do not flow to the platform layer - For Android RN/Expo/dev-client apps connected to any local Metro port, `adb reverse tcp: tcp:` is harmless and should be run before opening the app or URL on the emulator/device. - In sandboxed agent environments, run manual `agent-device` CLI verification that starts the daemon outside the sandbox with escalation. The daemon binds localhost, and sandboxed runs can fail before any product code executes with `listen EPERM: operation not permitted 127.0.0.1` or repeated `Failed to start daemon`/metadata cleanup messages. Do not spend time debugging those as agent-device regressions; rerun the same command with escalation. Unit tests, typecheck, lint, and build can stay sandboxed unless they need platform devices or network/listener access. +## Known environment traps (do not debug these as regressions) +- First `node` exec right after the dev-signed Apple runner launches can block ~19s at 0% CPU (Gatekeeper re-verification). It poisons back-to-back CLI wall-clock timing; absorb with a throwaway `node -e 0` or measure in-process/daemon-side. +- A leftover session holding the device fails every subsequent command instantly with `DEVICE_IN_USE` naming the owner; the hint's `close --session` guidance is the fix, not daemon debugging. +- Contention flakes: `request-handler-catalog` ("specialized daemon routes...") and the doctor provider scenario time out under host load. Protocol before believing a regression: rerun in isolation AND reproduce on plain `origin/main` under the same load. A changing failure set that passes in isolation is contention, not your change. + ## Manual Device Session Hygiene - Treat every manually opened `agent-device` session as a resource that must be closed, including exploratory sessions and failed verification attempts. - For experiments, use a purpose-specific session name and, when practical, an isolated `--state-dir` under `/private/tmp` when you need cleanup isolation beyond the current worktree's default daemon. diff --git a/CONTEXT.md b/CONTEXT.md index 78d1b729b..d214e1dd4 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -42,12 +42,24 @@ - Daemon command registry: daemon-side source of truth for command route ownership and request-policy traits, including admission exemptions, session locking, selector validation, replay-scoped actions, recording invalidation, Android dialog guards, and request provider device resolution. - Runner command traits: per-command-type classification for iOS/macOS runner lifecycle behavior, distinct from the public command surface and daemon command registry. The Swift runner traits classify interaction, read-only, and runner-lifecycle axes for XCTest execution; Swift resolves the alert command as read-only only for its `get` action. The TypeScript runner command traits classify daemon-side runner send/recovery policy such as read-only retry routing, readiness probes, and recent-healthy-mutation preflight skips; the TypeScript table is command-type keyed and currently classifies alert as read-only for daemon retry policy. Each side keeps one source of truth keyed by runner command type. - Coordinate-first resolved element activation: iOS/macOS runner interaction pattern where a selector or text query resolves the semantic `XCUIElement`, then activation uses the element's resolved center coordinate when a frame is available. This keeps target selection semantic while avoiding `XCUIElement.tap()` post-action element re-resolution after normal navigation. tvOS remains focus/remote-driven. +- Interaction dispatch path: one concrete route an interaction command takes to the device (runtime selector/ref resolution, direct iOS selector, native ref via web clickRef, coordinate, maestro non-hittable fallback). Every path classifies every guarantee in the ADR 0011 registry. +- Guarantee cell: one (dispatch path, guarantee) entry in `src/contracts/interaction-guarantees.ts`, classified as runtime/runner/delegated/inapplicable/waived. Completeness is a compile error; honesty is gate-tested. +- Owned waiver: a `gap:`-prefixed waived cell carrying a `trackingIssue` URL. Waivers are diffable debt with an owner, never folklore. +- Parity table: golden JSON fixture under `contracts/fixtures/` consumed by both vitest and the runner's gated Swift tests, so a cross-language rule (e.g. tap-point policy) cannot drift silently. Change the rule only via the table. +- Coverage manifest: `CONTRACT_COVERAGE` export beside each interaction contract test file claiming which matrix cells it proves; the coverage gate requires every enforced/delegated cell to be claimed and rejects overclaims of waived cells. +- Delegation-on-error: a fast path falling back to the runtime path on semantic failure shapes. It closes failure-side guarantee cells only — never success-path parity. - Snapshot capture plan: per-strategy ordered chain of iOS snapshot capture backends (recursive tree, query sweep, private AX) run by one plan runner under a shared wall-clock budget; recovery ordering is declared data, never a per-call-site branch. - Snapshot quality verdict: structured outcome (state, backend, reason code, effective depth, collapsed leaves) computed once by the plan runner and shipped with every planned snapshot payload; the daemon and CLI render it instead of re-deriving degradation from node shapes. - AX-unavailable target invalidation: iOS/macOS runner behavior where a root accessibility snapshot failure such as `kAXErrorIllegalArgument` marks the cached `XCUIApplication` target handle suspect. The runner fails closed for degraded interactive snapshots, clears the cached target, and lets the next command reacquire the app through normal activation. ## Architecture (perfect-shape refactor, completed 2026-07) +ADR 0011 (interaction guarantee contract) is the interaction-semantics counterpart of ADR 0008's +registry thesis: the dispatch-path × guarantee matrix is declared once in +`src/contracts/interaction-guarantees.ts`, completeness is type-enforced, honesty and coverage are +gate-enforced, and cross-language rules are pinned by golden parity tables. New dispatch paths and +guarantees are whole-matrix decisions, not local edits. + The perfect-shape refactor is complete and merged. Its end-state: - Two derivation registries. One `CommandDescriptor` per command diff --git a/docs/adr/0011-interaction-guarantee-contract.md b/docs/adr/0011-interaction-guarantee-contract.md index 54da902f2..cc2993549 100644 --- a/docs/adr/0011-interaction-guarantee-contract.md +++ b/docs/adr/0011-interaction-guarantee-contract.md @@ -2,7 +2,7 @@ ## Status -Proposed +Accepted (implemented through Layer 3, 2026-07-04: #1080, #1082–#1086, #1091, #1092) ## Context diff --git a/docs/adr/README.md b/docs/adr/README.md new file mode 100644 index 000000000..e8bbedc58 --- /dev/null +++ b/docs/adr/README.md @@ -0,0 +1,18 @@ +# ADR index — read this when… + +| ADR | Read when you touch… | +| --- | --- | +| [0001 Provider-First Integration Scenarios](0001-provider-first-integration-scenarios.md) | integration testing strategy, provider transcripts, the scenario harness | +| [0002 Persistent Platform Helper Sessions](0002-persistent-platform-helper-sessions.md) | helper process lifecycle, keep-alive semantics | +| [0003 Daemon Command Registry](0003-daemon-command-registry.md) | daemon routing, request-policy traits | +| [0004 iOS Snapshot Backend Strategy](0004-ios-snapshot-backend-strategy.md) | snapshot capture plans, backend fallbacks, quality verdicts | +| [0005 iOS Runner Interaction Lifecycle](0005-ios-runner-interaction-lifecycle.md) | XCTest runner sessions, leases, adoption, idle-stop | +| [0006 Daemon RPC Protocol Version](0006-daemon-rpc-protocol-version.md) | remote daemon HTTP/JSON-RPC compatibility | +| [0007 Remote Device Leases](0007-remote-device-leases.md) | leases, tenancy, provider-owned devices | +| [0008 Command Descriptor Registry](0008-command-descriptor-registry.md) | adding/changing a command, any surface projection (CLI/MCP/client/batch), timeout policy | +| [0009 Apple Platform Consolidation](0009-apple-platform-consolidation.md) | Apple platform family, apple/appleOs axes, the apple-leak guard | +| [0010 Error system conventions](0010-error-system.md) | error codes, hints, normalizeError, typed error signals | +| [0011 Interaction Guarantee Contract](0011-interaction-guarantee-contract.md) | interaction dispatch paths, fast paths, guards, the guarantee matrix, parity tables | + +ADRs record *why*; the registries and gates they describe are the living source of truth — when +prose and a registry disagree, the registry wins and the ADR needs a follow-up.