Skip to content

fix(substrate): data-plane reliability, concurrency & scale hardening#253

Merged
drewstone merged 8 commits into
mainfrom
harden/substrate-reliability
Jun 19, 2026
Merged

fix(substrate): data-plane reliability, concurrency & scale hardening#253
drewstone merged 8 commits into
mainfrom
harden/substrate-reliability

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Reliability, concurrency, and scale hardening of the trace/eval data plane, driven by a four-front staff-level audit (resilience, correctness/edge-cases, concurrency/scale, test-coverage). Every fix is surgical and additive — no public signature changed, no behavior removed — and lands with a regression test that fails on the old code.

Verification

  • Tests: 2425 pass / 2 skipped / 0 fail (243 files; +124 tests, +16 files vs the pre-hardening baseline of 2301)
  • Typecheck: tsc --noEmit clean
  • Build: tsup clean (DTS + OpenAPI emitted)
  • Modified existing tests add 43 assertions and remove 2 — both removed were assertions that encoded the old searchTrace fabricated-count bug, replaced with correct-behavior assertions. No test deleted/skipped/.only.

Tranches

1 — de-dup keys, OTLP timestamps, budget split

  • argHash always returns a string (undefined/function args no longer break the de-dup keys behind stuck-loop & failure-cluster); undefined-valued keys match JSON semantics
  • stuckLoopView: NUL-keyed buckets carry the tool name, so a tool name containing the delimiter (e.g. shell|grep) is no longer mislabeled or collided
  • OtlpFileTraceStore: NaN-safe, dialect-aware (ISO + epoch-ms) timestamp parsing for duration, time-range, and span ordering
  • AnalystRegistry: budget splits across analysts that actually run, not ones skipped for missing input

2 — fail-loud capture + durable trace store

  • executor: throws CaptureIntegrityError on a malformed chat response instead of scoring an empty turn (a legitimately-empty completion is preserved); judge retries capture/log the real error, gate on isTransientLlmError, and record per-judge failure reasons
  • FileSystemTraceStore: awaits the index mirror (no floating promise), serializes per-file appends behind an async mutex (no read-modify-write rollover race), and applies _update patches in a second load pass (a patch read before its base no longer inserts a fieldless fragment)
  • eval-campaign: Promise.allSettled + finally finalization so one worker's bug can't orphan the others' runs; the abortRun catch is narrowed so real store errors surface

3 — fail-loud HTTP/LLM + bounded fan-out + subprocess safety

  • ProductClient: single request() boundary checks res.ok before res.json(), requireArray() replaces ?? [] masking, per-request timeout
  • callLlm: optional caller AbortSignal (linked via AbortSignal.any, fatal/never-retried), cross-attempt deadlineMs budget, and finishReason/contentEmpty so free-form callers can fail loud
  • judge-runner + prm/inference: unbounded Promise.all fan-outs replaced with the repo's bounded worker pool; the PRM ensemble flattens the grader×runId product and uses allSettled to isolate failures
  • command-runner: distinguishes ENOENT from real I/O/permission errors; resolves bins via command -v/PATH not which
  • sandbox-harness: killedByTimeout is a hard fail regardless of exit code (a timeout can't masquerade as a pass); output buffers capped, listeners detached
  • sandbox-pool: a slot whose reset fails is destroyed/replaced, never recycled dirty

4 — scale guards, O(n²) fix, correctness + coverage backfill

  • OtlpFileTraceStore: maxFileBytes guard (fail loud, no OOM) + event-loop yields during index build; per-call truncation counter (no cross-call WeakMap race); searchTrace reports an exact total when uncapped and leans on has_more when capped; attribute-path scan skips escaped quotes
  • tool-waste: O(tools×llms) scan replaced with a sorted binary-search cutoff; an empty tool result is no longer miscounted as wasted
  • behavioral-metrics: a 0 → N input-token blowup now fires monotonic-input-growth
  • first-divergence: handles the empty-trajectory (minLen 0) edge without a negative index or leaked undefined step
  • coverage: new tests for the 8 previously-untested pipeline views, otlp-span, and detector edge cases

Method

Findings came from four parallel evidence-based auditors; the fixes were implemented across disjoint file-sets and each adversarially reviewed (an independent skeptic tried to refute the fix, check for new silent fallbacks, and confirm the regression tests are real) before the full suite gate.

Residuals (follow-ups, out of scope here)

  • FileSystemTraceStore rolled-file naming uses Date.now() only — two rollovers in the same millisecond could collide on the destination path (not triggered by the mutex fix).
  • OtlpFileTraceStore still reads the whole file into one Buffer; the maxFileBytes guard + yields remove the OOM/event-loop-starvation risk, but a true streaming index is a larger follow-up.

- argHash always returns a string (undefined/function args no longer break
  the de-dup keys behind stuck-loop and failure-cluster)
- stuckLoopView: NUL-keyed buckets carry the tool name, so a tool name
  containing the delimiter is no longer mislabeled or collided
- OtlpFileTraceStore: NaN-safe, dialect-aware (ISO + epoch-ms) timestamp
  parsing for trace duration, time-range, and span ordering
- AnalystRegistry: budget splits across analysts that actually run, not
  ones skipped for missing input

Adds regression tests for all four.
…support

ProductClient: route GET/POST/PATCH through a single request() boundary that
checks res.ok before res.json() and throws status+body on failure; replace the
?? [] collection masking with a requireArray() helper that surfaces a contract
violation instead of a fake empty set; add a per-request AbortController timeout
(ProductClientConfig.timeoutMs, default 30s).

callLlm: accept an optional caller AbortSignal linked to the per-attempt timeout
via AbortSignal.any (caller abort is fatal, never retried); enforce a
cross-attempt deadlineMs wall-clock budget that stops the retry loop early;
expose finishReason and contentEmpty so free-form callers can fail loud on
truncation or empty completions.
- executor: throw CaptureIntegrityError on a malformed chat response instead
  of scoring an empty turn; preserve a legitimately-empty completion. Judge
  retries capture the real error, log it, gate on isTransientLlmError (no
  blind retry of deterministic failures), and record per-judge failure reasons
- FileSystemTraceStore: await the index mirror (no floating promise), serialize
  per-file appends behind an async mutex (no read-modify-write rollover race),
  and apply _update patches in a second load pass so a patch read before its
  base no longer inserts a fieldless fragment
- eval-campaign: Promise.allSettled + finally finalization so one worker's bug
  can't orphan the others' runs; narrow the abortRun catch to the
  already-finalized case so real store errors surface
…e 3)

- judge-runner + prm/inference: replace unbounded Promise.all fan-outs with the
  repo's bounded worker-pool pattern (default concurrency = cpu count); the PRM
  ensemble flattens the grader×runId product and uses allSettled so one grader's
  failure no longer voids the ensemble or storms the rate limit
- command-runner: distinguish ENOENT from real I/O/permission errors (no longer
  collapse to 'absent'/'empty'); resolve bins via command-v/PATH not which
- sandbox-harness: flag killedByTimeout and treat it as a hard fail regardless
  of exit code (a timed-out run can't masquerade as a pass); cap output buffers
  and detach listeners on completion
- sandbox-pool: a slot whose reset fails is destroyed and replaced, never
  recycled dirty to the next waiter
…l (tranche 4)

- OtlpFileTraceStore: maxFileBytes guard (fail loud, no OOM) + event-loop yields
  during index build; per-call truncation counter (no cross-call WeakMap race);
  searchTrace reports an exact total when uncapped and leans on has_more when
  capped (no fabricated count); attribute-path scan skips escaped quotes
- tool-waste: O(tools x llms) scan replaced with a sorted binary-search cutoff;
  an empty tool result is no longer miscounted as wasted
- behavioral-metrics: a 0 -> N input-token blowup now fires monotonic-input-growth
- first-divergence: handle the empty-trajectory (minLen 0) edge without a
  negative index or leaked undefined step
- coverage: new tests for the 8 previously-untested pipeline views, otlp-span,
  and detector edge cases
tangletools
tangletools previously approved these changes Jun 19, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 1c91d258

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T13:20:53Z

Format + import-organize the tranche-1 files that missed the pre-commit
formatter; fixes the CI biome lint failure. No behavior change.
tangletools
tangletools previously approved these changes Jun 19, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 79449925

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T13:30:30Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 4 (2 low, 2 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 307.1s (2 bridge agents)
Total 307.1s

💰 Value — sound-with-nits

Surgical reliability hardening across 43 files — fail-loud HTTP/FS contract enforcement, correct errno discrimination, budget pre-count fix, O(n²) pipeline perf fixes, and durable trace store serialisation — all in the grain of the existing codebase and backed by regression tests.

  • What it does: Hardens the data plane in four tranches: (1) fail-loud HTTP client — checks res.ok before JSON parse, aborts hung requests, and rejects missing array fields instead of masking 500/401 errors as empty results; (2) errno-aware FS — readFile/readDir surface non-ENOENT errors (EACCES, EISDIR, EIO) instead of collapsing them to null/[], and hasBin walks PATH directly instead of shelling out to
  • Goals it achieves: Eliminates silent data corruption paths — a 500 error can no longer masquerade as zero tasks, a permission-denied file read no longer reports 'file not found', a malformed chat response no longer scores as an empty turn, a tool name with a pipe no longer gets mislabeled, and a stale searchTrace cap no longer fabricates match counts. Every fix converts a silent-failure mode into a loud, debuggabl
  • Assessment: Each change is coherent, minimal in blast radius (no public API changes), and follows existing codebase idioms: CaptureIntegrityError extends the existing error hierarchy, isTransientLlmError is an existing function re-used rather than re-created, errno discrimination mirrors the existing isErrnoCode helper, and the promise-chain lock follows the same per-name-keyed serialization pattern as
  • Better / existing approach: Two weak nits, neither blocking: (1) stableStringify is duplicated 5× across the codebase (src/trace/query.ts, src/workflow/sanitize.ts, src/wire/schemas.ts, src/prm/builtin-rubrics.ts, src/campaign/distillation/run-distillation.ts) — this PR fixes only the copy used by argHash (stuck-loop, failure-cluster), leaving 4 other copies with the same undefined/function-arg bug. A central s
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

🎯 Usefulness — sound

Reliability hardening across 18 data-plane primitives — every fix is additive, surgical, and regression-tested; all changes have clear caller paths; no competing patterns or reinvention.

  • Integration: All 18 behavior changes are reachable. Internal callers: argHash is called by stuckLoopView/toolWasteView/toolUseMetrics (src/pipelines/stuck-loop.ts:34, src/pipelines/tool-waste.ts:4, src/tool-use-metrics.ts:5); OtlpFileTraceStore methods are called by analyzeTraces (src/trace-analyst/analyst.ts:15) and campaign drivers (src/campaign/drivers/trace-analyst.ts:10); AnalystRegistry.run is called by
  • Fit with existing patterns: Every change follows the codebase's established patterns. The request() + requireArray() in ProductClient consolidates three identical inline fetch blocks into a single boundary — matches the substrate's 'fail loud on external boundaries' doctrine from CLAUDE.md. resolveBinOnPath replaces a subprocess shell dependency with a pure PATH walk — consistent with the repo's prior removal of shell deps.
  • Real-world viability: Each fix addresses a confirmed production-failure mode, not just the happy path. Concurrency: FileSystemTraceStore append is now serialized behind an async mutex (prevents read-modify-write rollover race), the index mirror is awaited (prevents silent disk/index divergence), and _update patches apply in a second pass after all base rows load (fixes patch-before-base corruption when readdir order is
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added src/executor.ts

  •      console.log(`    ${judgeName} retry ${attempt}/2 (waiting ${wait / 1000}s)`)
    

🟡 Cruft: magic number added src/executor.ts

  •      console.log(`    ${judgeName} retry ${attempt}/2 (waiting ${wait / 1000}s)`)
    

💰 Value Audit

🟡 stableStringify duplicated 5× — fix applied to only 1 copy [duplication] ``

The undefined and function/symbol guards added to stableStringify in src/trace/query.ts:57-73 fix a real bug (non-string return from argHash breaking de-dup keys). However, four other copies of stableStringify exist at src/workflow/sanitize.ts:160, src/wire/schemas.ts:401, src/prm/builtin-rubrics.ts:125, and src/campaign/distillation/run-distillation.ts:269 — all with the same JSON.stringify(undefined)undefined gap. This is pre-existing codebase debt; the PR is correctly

🟡 Promise-chain lock reimplements serialization already provided by Mutex [duplication] ``

FileSystemTraceStore.append() at src/trace/store.ts:209-227 implements a custom promise-chain lock (14 lines with error isolation and tail cleanup). The codebase already has Mutex.runExclusive() (src/concurrency.ts:13) used by LockedJsonlAppender (src/locked-jsonl-appender.ts:28) for the same per-file append serialization pattern. Reusing Mutex would be ~4 lines (mutex.runExclusive(() => this.appendLocked(...))) and avoids the subtle error-swalling tail promise and tail-pointer c


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260619T133732Z

These were surfaced by an adversarial review of the combined diff and proven
(via git) to pre-date this branch; folded in here since this is the hardening PR.

- FileSystemTraceStore.load: replay files in write order (rolled <ts> ascending,
  bare active last) instead of readdir order, so an older patch can't overwrite a
  newer one and silently restore a stale field
- SandboxPool: count in-flight mints against capacity in the same critical section
  as the gate, so concurrent first checkouts can't both mint past size; release the
  reservation (and wake a waiter) if factory.create fails
- llm-client: correct the finishReason/contentEmpty docstrings — they are signals a
  caller MAY inspect; callLlm surfaces but does not itself reject on them

Deferred (documented): FileSystemTraceStore append-during-load lost-index-update —
a correct fix is a load/append coordination change that warrants its own PR.
tangletools
tangletools previously approved these changes Jun 19, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — c9b8cc4d

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T13:48:49Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 4 (1 medium-concern, 2 low, 1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 249.0s (2 bridge agents)
Total 249.0s

💰 Value — sound

Hardens the trace/eval data plane across 43 files with fail-loud boundaries, concurrency safety, and NaN-resistant parsing — all surgical, additive, and test-gated; the one design-in-progress concern (custom promise-chain lock in FileSystemTraceStore where Mutex would suffice) is stylistic, not corr

  • What it does: Adds six capability layers: (1) ProductClient gains a centralized request() method with timeout support, res.ok gating, and requireArray contract enforcement so 4xx/5xx/malformed responses throw instead of silently returning empty arrays; (2) command-runner replaces the which-based hasBin with a cross-platform resolveBinOnPath that probes PATH + PATHEXT via accessSync, and discri
  • Goals it achieves: Eliminates silent data corruption paths: (a) HTTP errors no longer masquerade as legitimate empty result sets, (b) IO failures other than 'file not found' are surfaced instead of silently collapsed to null/empty, (c) malformed LLM chat responses are rejected before scoring, (d) concurrent NDJSON appends cannot collide or lose rows, (e) trace replay order is deterministic and patch rows always find
  • Assessment: This is a coherent, well-executed hardening pass. Every change is additive — no public signature was broken, no existing behavior was removed. Each fix comes with a regression test that fails on the old code. The changes follow the codebase's existing patterns: Mutex for the sandbox pool, AgentEvalError hierarchy for errors, isTransientLlmError shared between executor and llm-client. The com
  • Better / existing approach: One design-in-progress concern, not a ship-blocker: FileSystemTraceStore in src/trace/store.ts:197-226 implements a custom per-file promise-chain lock (Map<string, Promise<void>> with non-rejecting tail promises) to serialize appends. The codebase already has Mutex (src/concurrency.ts) used for the same purpose by LockedJsonlAppender (src/locked-jsonl-appender.ts) and `reference-repl
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

🎯 Usefulness — sound

Comprehensive reliability/concurrency/scale hardening — every change fixes a specific defect with a regression test, zero public signature breaks, all reachable through existing callers or exported additive surface.

  • Integration: Every changed module is reachable: ProductClient → MetricTracker.getState (src/metrics.ts:193-198) and AgentDriver (src/driver.ts:171); executeScenario → BenchmarkRunner.run (src/benchmark.ts:40); AnalystRegistry → contract/analyze-runs.ts and campaign/drivers/trace-analyst.ts; FileSystemTraceStore → eval-campaign.ts and OTLP exporters; stuckLoopView/toolWasteView → exported pipelines consumed by
  • Fit with existing patterns: All changes follow the codebase's established patterns: the requireArray fail-loud matches the existing CaptureIntegrityError/runE2EWorkflow error-surfacing doctrine; the bounded-concurrency worker pool (judge-runner, PRM) is the identical pattern across both modules; sleep injection mirrors llm-client.ts:301; the two-pass load in FileSystemTraceStore fits the append-only NDJSON architec
  • Real-world viability: Each fix handles realistic edge conditions: ProductClient timeout via AbortController prevents hung harnesses; requireArray prevents a 500 error body from being parsed as []; CommandRunner distinguishes EACCES/EISDIR from ENOENT (a permission error masked as 'file absent' would silently skip critical artifacts); sandbox handle kills the whole process group so a runaway grandchild can't keep pi
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added src/executor.ts

  •      console.log(`    ${judgeName} retry ${attempt}/2 (waiting ${wait / 1000}s)`)
    

🟡 Cruft: magic number added src/executor.ts

  •      console.log(`    ${judgeName} retry ${attempt}/2 (waiting ${wait / 1000}s)`)
    

💰 Value Audit

🟠 FileSystemTraceStore implements custom promise-chain lock where Mutex already exists [duplication] ``

src/trace/store.ts:197-226 uses a custom Map<string, Promise<void>> with a non-rejecting tail-promise pattern to serialize per-file NDJSON appends. The codebase already has Mutex (src/concurrency.ts, exported at src/index.ts:1088) which provides the same serialization guarantee via runExclusive. LockedJsonlAppender (src/locked-jsonl-appender.ts) and reference-replay.ts both use Mutex for the exact same purpose (serializing JSONL appendFile calls). The appendLocked method's

🟡 isErrnoCode not exported despite reuse potential [against-grain] ``

src/command-runner.ts:169 defines a clean isErrnoCode(err, code) helper. At least 5 other files (src/trace-analyst/store-otlp.ts:443, src/trace/raw-provider-sink.ts:256, src/meta-eval/sentinel.ts:273, src/experiment-tracker.ts:314, src/eval-trace-store.ts:124) inline the same (err as NodeJS.ErrnoException).code === 'ENOENT' pattern with no abstraction. The PR doesn't export isErrnoCode or refactor those call sites. Not a correctness concern, but the pattern was worth unifying i


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260619T135451Z

Witnessed defects from the integrity audit (each proven by running the real
code, expected-vs-actual); refuted/intended candidates were dropped, not shipped.

- otlp-to-run-records: span ordering, earliest/latest bounds, and wallMs used
  raw string comparison + Date.parse, which mis-ordered spans and zeroed wallMs
  for epoch-millis-string timestamps. Now epoch-aware via shared spanEpochMillis
  / compareSpanTime helpers (mirrors the store-otlp fix).
- query.aggregateLlm: summed input/output/cached but omitted reasoningTokens
  (a defined LlmSpan field) — reasoning usage was invisible to cost/perf analysis.
- analyst.test: replaced a vacuous expect(true).toBe(true) with a real
  settles-not-hangs assertion.

Adds witness-backed regression tests for the timestamp + reasoning-token fixes.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 09ff6713

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T14:13:29Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 3 (2 low, 1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 387.0s (2 bridge agents)
Total 387.0s

💰 Value — sound

Comprehensive substrate hardening across 47 files: fail-loud HTTP/FS/LLM boundaries, bounded concurrency pools preventing resource exhaustion, budget-split fairness, and de-dup/serialization correctness — all surgical, additive, and regression-tested. Ship.

  • What it does: Hardens the data-plane reliability, concurrency, and correctness of the agent-eval substrate across seven tranches: (1) fail-loud HTTP client — centralized request() with res.ok check, requireArray() replacing silent ?? [] masking, and AbortController timeout; (2) fail-loud FS operations — command-runner readFile/readDir narrow ENOENT-only catch, isErrnoCode helper, and in-process hasBin via PATH
  • Goals it achieves: Makes the agent-eval substrate reliable at scale: HTTP/LLM/FS boundaries fail loud instead of silently masking errors as empty results; concurrent fan-out is bounded so wide eval matrices don't exhaust PIDs/file-descriptors or trigger rate-limit storms; orphaned runs from sibling failures are properly finalized instead of leaking resources; the analyst budget is fairly split so skipped inputs don'
  • Assessment: This is an excellent change. Every fix is surgical — no public signatures changed, no behavior removed, no existing consumers broken. The pattern is consistent throughout: identify a silent-corruption or resource-leak class, add a loud guard with a typed error, and land a regression test that proves the old code failed. The approach follows the codebase's documented 'fail loud' doctrine (CLAUDE.md
  • Better / existing approach: none — this is the right approach. Two minor observations that don't rise to concerns worth flagging: (1) the cursor-pool pattern is duplicated verbatim across eval-campaign.ts:388, judge-runner.ts:83, matrix/runner.ts:138, prm/inference.ts:41 and :82 — a future low-priority cleanup could extract a boundedPool(concurrency, items, worker) into src/concurrency.ts alongside the existing Mutex, but
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

🎯 Usefulness — sound

Thorough, well-structured hardening pass across the substrate data plane — every change is additive, correctly integrated into the existing architecture, backed by regression tests, and consumes/wires into the established codebase grain without competing patterns.

  • Integration: All changed surfaces are reachable. executeScenario is called by Benchmark (src/benchmark.ts:40); AnalystRegistry is the central analyst orchestration primitive used by campaign/analyst-surface.ts, examples/, and the contract layer; FileSystemTraceStore is the default trace store for campaigns (src/campaign/run-campaign.ts:72) and is consumed by 30+ modules (prm, rl, builder-eval, meta-eval, pipel
  • Fit with existing patterns: All changes follow the established substrate pattern: primitives exported via index barrel, dependency on shared abstractions (TraceStore interface, Analyst types, TCloud), no competing implementations. The ProductClient.request() centralizes what was previously N ad-hoc fetch calls each lacking timeout/ok-checks — consistent with the fail-loud doctrine in CLAUDE.md. The FileSystemTraceStore appen
  • Real-world viability: Changes address real failure modes: argHash string-return prevents stuck-loop de-dup from breaking on undefined/null/function args (verified by query.test.ts:22-45); FileSystemTraceStore append mutex prevents read-modify-write races under concurrent writes (verified by store.test.ts); two-pass load handles patches arriving before base rows across rolled files (store.test.ts); CaptureIntegrityError
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added src/executor.ts

  •      console.log(`    ${judgeName} retry ${attempt}/2 (waiting ${wait / 1000}s)`)
    

🟡 Cruft: magic number added src/executor.ts

  •      console.log(`    ${judgeName} retry ${attempt}/2 (waiting ${wait / 1000}s)`)
    

🎯 Usefulness Audit

🟡 login() and chat() bypass the new request() timeout boundary [integration] ``

ProductClient.request() is documented as 'Single HTTP boundary for every JSON request' (src/client.ts:29), but login() (line 87) and chat() (line 121) still use raw fetch() without AbortController timeout. login() additionally lacks the centralized res.ok→throw pattern (has a manual body.user check instead). Both methods have legitimate reasons for divergence — login() needs redirect: 'manual' + set-cookie, chat() needs response body streaming — so this is not a defect, but the doc comment and a


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260619T142151Z

@drewstone drewstone merged commit 4477db5 into main Jun 19, 2026
1 check passed
drewstone added a commit that referenced this pull request Jun 19, 2026
…hardening (#258)

Version-only bump of the trio (package.json + python pyproject + __init__).
Ships everything merged since 0.93.0:
- #253 data-plane reliability/concurrency/scale hardening
- #254 paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k)
- #255 trace-store append/load race + collision-free rollover
- #256 McNemar power + required-N
- #257 single source of truth for span-timestamp parsing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants