fix(substrate): data-plane reliability, concurrency & scale hardening by drewstone · Pull Request #253 · tangle-network/agent-eval

drewstone · 2026-06-19T13:20:45Z

What

Reliability, concurrency, and scale hardening of the trace/eval data plane, driven by a four-front staff-level audit (resilience, correctness/edge-cases, concurrency/scale, test-coverage). Every fix is surgical and additive — no public signature changed, no behavior removed — and lands with a regression test that fails on the old code.

Verification

Tests: 2425 pass / 2 skipped / 0 fail (243 files; +124 tests, +16 files vs the pre-hardening baseline of 2301)
Typecheck: tsc --noEmit clean
Build: tsup clean (DTS + OpenAPI emitted)
Modified existing tests add 43 assertions and remove 2 — both removed were assertions that encoded the old searchTrace fabricated-count bug, replaced with correct-behavior assertions. No test deleted/skipped/.only.

Tranches

1 — de-dup keys, OTLP timestamps, budget split

argHash always returns a string (undefined/function args no longer break the de-dup keys behind stuck-loop & failure-cluster); undefined-valued keys match JSON semantics
stuckLoopView: NUL-keyed buckets carry the tool name, so a tool name containing the delimiter (e.g. shell|grep) is no longer mislabeled or collided
OtlpFileTraceStore: NaN-safe, dialect-aware (ISO + epoch-ms) timestamp parsing for duration, time-range, and span ordering
AnalystRegistry: budget splits across analysts that actually run, not ones skipped for missing input

2 — fail-loud capture + durable trace store

executor: throws CaptureIntegrityError on a malformed chat response instead of scoring an empty turn (a legitimately-empty completion is preserved); judge retries capture/log the real error, gate on isTransientLlmError, and record per-judge failure reasons
FileSystemTraceStore: awaits the index mirror (no floating promise), serializes per-file appends behind an async mutex (no read-modify-write rollover race), and applies _update patches in a second load pass (a patch read before its base no longer inserts a fieldless fragment)
eval-campaign: Promise.allSettled + finally finalization so one worker's bug can't orphan the others' runs; the abortRun catch is narrowed so real store errors surface

3 — fail-loud HTTP/LLM + bounded fan-out + subprocess safety

ProductClient: single request() boundary checks res.ok before res.json(), requireArray() replaces ?? [] masking, per-request timeout
callLlm: optional caller AbortSignal (linked via AbortSignal.any, fatal/never-retried), cross-attempt deadlineMs budget, and finishReason/contentEmpty so free-form callers can fail loud
judge-runner + prm/inference: unbounded Promise.all fan-outs replaced with the repo's bounded worker pool; the PRM ensemble flattens the grader×runId product and uses allSettled to isolate failures
command-runner: distinguishes ENOENT from real I/O/permission errors; resolves bins via command -v/PATH not which
sandbox-harness: killedByTimeout is a hard fail regardless of exit code (a timeout can't masquerade as a pass); output buffers capped, listeners detached
sandbox-pool: a slot whose reset fails is destroyed/replaced, never recycled dirty

4 — scale guards, O(n²) fix, correctness + coverage backfill

OtlpFileTraceStore: maxFileBytes guard (fail loud, no OOM) + event-loop yields during index build; per-call truncation counter (no cross-call WeakMap race); searchTrace reports an exact total when uncapped and leans on has_more when capped; attribute-path scan skips escaped quotes
tool-waste: O(tools×llms) scan replaced with a sorted binary-search cutoff; an empty tool result is no longer miscounted as wasted
behavioral-metrics: a 0 → N input-token blowup now fires monotonic-input-growth
first-divergence: handles the empty-trajectory (minLen 0) edge without a negative index or leaked undefined step
coverage: new tests for the 8 previously-untested pipeline views, otlp-span, and detector edge cases

Method

Findings came from four parallel evidence-based auditors; the fixes were implemented across disjoint file-sets and each adversarially reviewed (an independent skeptic tried to refute the fix, check for new silent fallbacks, and confirm the regression tests are real) before the full suite gate.

Residuals (follow-ups, out of scope here)

FileSystemTraceStore rolled-file naming uses Date.now() only — two rollovers in the same millisecond could collide on the destination path (not triggered by the mutex fix).
OtlpFileTraceStore still reads the whole file into one Buffer; the maxFileBytes guard + yields remove the OOM/event-loop-starvation risk, but a true streaming index is a larger follow-up.

- argHash always returns a string (undefined/function args no longer break the de-dup keys behind stuck-loop and failure-cluster) - stuckLoopView: NUL-keyed buckets carry the tool name, so a tool name containing the delimiter is no longer mislabeled or collided - OtlpFileTraceStore: NaN-safe, dialect-aware (ISO + epoch-ms) timestamp parsing for trace duration, time-range, and span ordering - AnalystRegistry: budget splits across analysts that actually run, not ones skipped for missing input Adds regression tests for all four.

…support ProductClient: route GET/POST/PATCH through a single request() boundary that checks res.ok before res.json() and throws status+body on failure; replace the ?? [] collection masking with a requireArray() helper that surfaces a contract violation instead of a fake empty set; add a per-request AbortController timeout (ProductClientConfig.timeoutMs, default 30s). callLlm: accept an optional caller AbortSignal linked to the per-attempt timeout via AbortSignal.any (caller abort is fatal, never retried); enforce a cross-attempt deadlineMs wall-clock budget that stops the retry loop early; expose finishReason and contentEmpty so free-form callers can fail loud on truncation or empty completions.

- executor: throw CaptureIntegrityError on a malformed chat response instead of scoring an empty turn; preserve a legitimately-empty completion. Judge retries capture the real error, log it, gate on isTransientLlmError (no blind retry of deterministic failures), and record per-judge failure reasons - FileSystemTraceStore: await the index mirror (no floating promise), serialize per-file appends behind an async mutex (no read-modify-write rollover race), and apply _update patches in a second load pass so a patch read before its base no longer inserts a fieldless fragment - eval-campaign: Promise.allSettled + finally finalization so one worker's bug can't orphan the others' runs; narrow the abortRun catch to the already-finalized case so real store errors surface

…e 3) - judge-runner + prm/inference: replace unbounded Promise.all fan-outs with the repo's bounded worker-pool pattern (default concurrency = cpu count); the PRM ensemble flattens the grader×runId product and uses allSettled so one grader's failure no longer voids the ensemble or storms the rate limit - command-runner: distinguish ENOENT from real I/O/permission errors (no longer collapse to 'absent'/'empty'); resolve bins via command-v/PATH not which - sandbox-harness: flag killedByTimeout and treat it as a hard fail regardless of exit code (a timed-out run can't masquerade as a pass); cap output buffers and detach listeners on completion - sandbox-pool: a slot whose reset fails is destroyed and replaced, never recycled dirty to the next waiter

…l (tranche 4) - OtlpFileTraceStore: maxFileBytes guard (fail loud, no OOM) + event-loop yields during index build; per-call truncation counter (no cross-call WeakMap race); searchTrace reports an exact total when uncapped and leans on has_more when capped (no fabricated count); attribute-path scan skips escaped quotes - tool-waste: O(tools x llms) scan replaced with a sorted binary-search cutoff; an empty tool result is no longer miscounted as wasted - behavioral-metrics: a 0 -> N input-token blowup now fires monotonic-input-growth - first-divergence: handle the empty-trajectory (minLen 0) edge without a negative index or leaked undefined step - coverage: new tests for the 8 previously-untested pipeline views, otlp-span, and detector edge cases

tangletools

✅ Auto-approved PR — `1c91d258`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T13:20:53Z}

Format + import-organize the tranche-1 files that missed the pre-commit formatter; fixes the CI biome lint failure. No behavior change.

tangletools

✅ Auto-approved PR — `79449925`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T13:30:30Z}

tangletools

🟡 Value Audit — sound-with-nits


Verdict	sound-with-nits
Concerns	4 (2 low, 2 weak-concern)
Heuristic	0.0s
Duplication	0.0s
Interrogation	307.1s (2 bridge agents)
Total	307.1s

💰 Value — sound-with-nits

Surgical reliability hardening across 43 files — fail-loud HTTP/FS contract enforcement, correct errno discrimination, budget pre-count fix, O(n²) pipeline perf fixes, and durable trace store serialisation — all in the grain of the existing codebase and backed by regression tests.

What it does: Hardens the data plane in four tranches: (1) fail-loud HTTP client — checks res.ok before JSON parse, aborts hung requests, and rejects missing array fields instead of masking 500/401 errors as empty results; (2) errno-aware FS — readFile/readDir surface non-ENOENT errors (EACCES, EISDIR, EIO) instead of collapsing them to null/[], and hasBin walks PATH directly instead of shelling out to
Goals it achieves: Eliminates silent data corruption paths — a 500 error can no longer masquerade as zero tasks, a permission-denied file read no longer reports 'file not found', a malformed chat response no longer scores as an empty turn, a tool name with a pipe no longer gets mislabeled, and a stale searchTrace cap no longer fabricates match counts. Every fix converts a silent-failure mode into a loud, debuggabl
Assessment: Each change is coherent, minimal in blast radius (no public API changes), and follows existing codebase idioms: CaptureIntegrityError extends the existing error hierarchy, isTransientLlmError is an existing function re-used rather than re-created, errno discrimination mirrors the existing isErrnoCode helper, and the promise-chain lock follows the same per-name-keyed serialization pattern as
Better / existing approach: Two weak nits, neither blocking: (1) stableStringify is duplicated 5× across the codebase (src/trace/query.ts, src/workflow/sanitize.ts, src/wire/schemas.ts, src/prm/builtin-rubrics.ts, src/campaign/distillation/run-distillation.ts) — this PR fixes only the copy used by argHash (stuck-loop, failure-cluster), leaving 4 other copies with the same undefined/function-arg bug. A central s
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🎯 Usefulness — sound

Reliability hardening across 18 data-plane primitives — every fix is additive, surgical, and regression-tested; all changes have clear caller paths; no competing patterns or reinvention.

Integration: All 18 behavior changes are reachable. Internal callers: argHash is called by stuckLoopView/toolWasteView/toolUseMetrics (src/pipelines/stuck-loop.ts:34, src/pipelines/tool-waste.ts:4, src/tool-use-metrics.ts:5); OtlpFileTraceStore methods are called by analyzeTraces (src/trace-analyst/analyst.ts:15) and campaign drivers (src/campaign/drivers/trace-analyst.ts:10); AnalystRegistry.run is called by
Fit with existing patterns: Every change follows the codebase's established patterns. The request() + requireArray() in ProductClient consolidates three identical inline fetch blocks into a single boundary — matches the substrate's 'fail loud on external boundaries' doctrine from CLAUDE.md. resolveBinOnPath replaces a subprocess shell dependency with a pure PATH walk — consistent with the repo's prior removal of shell deps.
Real-world viability: Each fix addresses a confirmed production-failure mode, not just the happy path. Concurrency: FileSystemTraceStore append is now serialized behind an async mutex (prevents read-modify-write rollover race), the index mirror is awaited (prevents silent disk/index divergence), and _update patches apply in a second pass after all base rows load (fixes patch-before-base corruption when readdir order is
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added src/executor.ts

     console.log(`    ${judgeName} retry ${attempt}/2 (waiting ${wait / 1000}s)`)

🟡 Cruft: magic number added src/executor.ts

     console.log(`    ${judgeName} retry ${attempt}/2 (waiting ${wait / 1000}s)`)

💰 Value Audit

🟡 stableStringify duplicated 5× — fix applied to only 1 copy [duplication] ``

The undefined and function/symbol guards added to stableStringify in src/trace/query.ts:57-73 fix a real bug (non-string return from argHash breaking de-dup keys). However, four other copies of stableStringify exist at src/workflow/sanitize.ts:160, src/wire/schemas.ts:401, src/prm/builtin-rubrics.ts:125, and src/campaign/distillation/run-distillation.ts:269 — all with the same JSON.stringify(undefined) → undefined gap. This is pre-existing codebase debt; the PR is correctly

🟡 Promise-chain lock reimplements serialization already provided by Mutex [duplication] ``

FileSystemTraceStore.append() at src/trace/store.ts:209-227 implements a custom promise-chain lock (14 lines with error isolation and tail cleanup). The codebase already has Mutex.runExclusive() (src/concurrency.ts:13) used by LockedJsonlAppender (src/locked-jsonl-appender.ts:28) for the same per-file append serialization pattern. Reusing Mutex would be ~4 lines (mutex.runExclusive(() => this.appendLocked(...))) and avoids the subtle error-swalling tail promise and tail-pointer c

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260619T133732Z}

These were surfaced by an adversarial review of the combined diff and proven (via git) to pre-date this branch; folded in here since this is the hardening PR. - FileSystemTraceStore.load: replay files in write order (rolled <ts> ascending, bare active last) instead of readdir order, so an older patch can't overwrite a newer one and silently restore a stale field - SandboxPool: count in-flight mints against capacity in the same critical section as the gate, so concurrent first checkouts can't both mint past size; release the reservation (and wake a waiter) if factory.create fails - llm-client: correct the finishReason/contentEmpty docstrings — they are signals a caller MAY inspect; callLlm surfaces but does not itself reject on them Deferred (documented): FileSystemTraceStore append-during-load lost-index-update — a correct fix is a load/append coordination change that warrants its own PR.

tangletools

✅ Auto-approved PR — `c9b8cc4d`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T13:48:49Z}

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	4 (1 medium-concern, 2 low, 1 weak-concern)
Heuristic	0.0s
Duplication	0.0s
Interrogation	249.0s (2 bridge agents)
Total	249.0s

💰 Value — sound

Hardens the trace/eval data plane across 43 files with fail-loud boundaries, concurrency safety, and NaN-resistant parsing — all surgical, additive, and test-gated; the one design-in-progress concern (custom promise-chain lock in FileSystemTraceStore where Mutex would suffice) is stylistic, not corr

What it does: Adds six capability layers: (1) ProductClient gains a centralized request() method with timeout support, res.ok gating, and requireArray contract enforcement so 4xx/5xx/malformed responses throw instead of silently returning empty arrays; (2) command-runner replaces the which-based hasBin with a cross-platform resolveBinOnPath that probes PATH + PATHEXT via accessSync, and discri
Goals it achieves: Eliminates silent data corruption paths: (a) HTTP errors no longer masquerade as legitimate empty result sets, (b) IO failures other than 'file not found' are surfaced instead of silently collapsed to null/empty, (c) malformed LLM chat responses are rejected before scoring, (d) concurrent NDJSON appends cannot collide or lose rows, (e) trace replay order is deterministic and patch rows always find
Assessment: This is a coherent, well-executed hardening pass. Every change is additive — no public signature was broken, no existing behavior was removed. Each fix comes with a regression test that fails on the old code. The changes follow the codebase's existing patterns: Mutex for the sandbox pool, AgentEvalError hierarchy for errors, isTransientLlmError shared between executor and llm-client. The com
Better / existing approach: One design-in-progress concern, not a ship-blocker: FileSystemTraceStore in src/trace/store.ts:197-226 implements a custom per-file promise-chain lock (Map<string, Promise<void>> with non-rejecting tail promises) to serialize appends. The codebase already has Mutex (src/concurrency.ts) used for the same purpose by LockedJsonlAppender (src/locked-jsonl-appender.ts) and `reference-repl
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🎯 Usefulness — sound

Comprehensive reliability/concurrency/scale hardening — every change fixes a specific defect with a regression test, zero public signature breaks, all reachable through existing callers or exported additive surface.

Integration: Every changed module is reachable: ProductClient → MetricTracker.getState (src/metrics.ts:193-198) and AgentDriver (src/driver.ts:171); executeScenario → BenchmarkRunner.run (src/benchmark.ts:40); AnalystRegistry → contract/analyze-runs.ts and campaign/drivers/trace-analyst.ts; FileSystemTraceStore → eval-campaign.ts and OTLP exporters; stuckLoopView/toolWasteView → exported pipelines consumed by
Fit with existing patterns: All changes follow the codebase's established patterns: the requireArray fail-loud matches the existing CaptureIntegrityError/runE2EWorkflow error-surfacing doctrine; the bounded-concurrency worker pool (judge-runner, PRM) is the identical pattern across both modules; sleep injection mirrors llm-client.ts:301; the two-pass load in FileSystemTraceStore fits the append-only NDJSON architec
Real-world viability: Each fix handles realistic edge conditions: ProductClient timeout via AbortController prevents hung harnesses; requireArray prevents a 500 error body from being parsed as []; CommandRunner distinguishes EACCES/EISDIR from ENOENT (a permission error masked as 'file absent' would silently skip critical artifacts); sandbox handle kills the whole process group so a runaway grandchild can't keep pi
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added src/executor.ts

     console.log(`    ${judgeName} retry ${attempt}/2 (waiting ${wait / 1000}s)`)

🟡 Cruft: magic number added src/executor.ts

     console.log(`    ${judgeName} retry ${attempt}/2 (waiting ${wait / 1000}s)`)

💰 Value Audit

🟠 FileSystemTraceStore implements custom promise-chain lock where Mutex already exists [duplication] ``

src/trace/store.ts:197-226 uses a custom Map<string, Promise<void>> with a non-rejecting tail-promise pattern to serialize per-file NDJSON appends. The codebase already has Mutex (src/concurrency.ts, exported at src/index.ts:1088) which provides the same serialization guarantee via runExclusive. LockedJsonlAppender (src/locked-jsonl-appender.ts) and reference-replay.ts both use Mutex for the exact same purpose (serializing JSONL appendFile calls). The appendLocked method's

🟡 isErrnoCode not exported despite reuse potential [against-grain] ``

src/command-runner.ts:169 defines a clean isErrnoCode(err, code) helper. At least 5 other files (src/trace-analyst/store-otlp.ts:443, src/trace/raw-provider-sink.ts:256, src/meta-eval/sentinel.ts:273, src/experiment-tracker.ts:314, src/eval-trace-store.ts:124) inline the same (err as NodeJS.ErrnoException).code === 'ENOENT' pattern with no abstraction. The PR doesn't export isErrnoCode or refactor those call sites. Not a correctness concern, but the pattern was worth unifying i

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260619T135451Z}

Witnessed defects from the integrity audit (each proven by running the real code, expected-vs-actual); refuted/intended candidates were dropped, not shipped. - otlp-to-run-records: span ordering, earliest/latest bounds, and wallMs used raw string comparison + Date.parse, which mis-ordered spans and zeroed wallMs for epoch-millis-string timestamps. Now epoch-aware via shared spanEpochMillis / compareSpanTime helpers (mirrors the store-otlp fix). - query.aggregateLlm: summed input/output/cached but omitted reasoningTokens (a defined LlmSpan field) — reasoning usage was invisible to cost/perf analysis. - analyst.test: replaced a vacuous expect(true).toBe(true) with a real settles-not-hangs assertion. Adds witness-backed regression tests for the timestamp + reasoning-token fixes.

tangletools

✅ Auto-approved PR — `09ff6713`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T14:13:29Z}

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	3 (2 low, 1 weak-concern)
Heuristic	0.0s
Duplication	0.0s
Interrogation	387.0s (2 bridge agents)
Total	387.0s

💰 Value — sound

Comprehensive substrate hardening across 47 files: fail-loud HTTP/FS/LLM boundaries, bounded concurrency pools preventing resource exhaustion, budget-split fairness, and de-dup/serialization correctness — all surgical, additive, and regression-tested. Ship.

What it does: Hardens the data-plane reliability, concurrency, and correctness of the agent-eval substrate across seven tranches: (1) fail-loud HTTP client — centralized request() with res.ok check, requireArray() replacing silent ?? [] masking, and AbortController timeout; (2) fail-loud FS operations — command-runner readFile/readDir narrow ENOENT-only catch, isErrnoCode helper, and in-process hasBin via PATH
Goals it achieves: Makes the agent-eval substrate reliable at scale: HTTP/LLM/FS boundaries fail loud instead of silently masking errors as empty results; concurrent fan-out is bounded so wide eval matrices don't exhaust PIDs/file-descriptors or trigger rate-limit storms; orphaned runs from sibling failures are properly finalized instead of leaking resources; the analyst budget is fairly split so skipped inputs don'
Assessment: This is an excellent change. Every fix is surgical — no public signatures changed, no behavior removed, no existing consumers broken. The pattern is consistent throughout: identify a silent-corruption or resource-leak class, add a loud guard with a typed error, and land a regression test that proves the old code failed. The approach follows the codebase's documented 'fail loud' doctrine (CLAUDE.md
Better / existing approach: none — this is the right approach. Two minor observations that don't rise to concerns worth flagging: (1) the cursor-pool pattern is duplicated verbatim across eval-campaign.ts:388, judge-runner.ts:83, matrix/runner.ts:138, prm/inference.ts:41 and :82 — a future low-priority cleanup could extract a boundedPool(concurrency, items, worker) into src/concurrency.ts alongside the existing Mutex, but
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🎯 Usefulness — sound

Thorough, well-structured hardening pass across the substrate data plane — every change is additive, correctly integrated into the existing architecture, backed by regression tests, and consumes/wires into the established codebase grain without competing patterns.

Integration: All changed surfaces are reachable. executeScenario is called by Benchmark (src/benchmark.ts:40); AnalystRegistry is the central analyst orchestration primitive used by campaign/analyst-surface.ts, examples/, and the contract layer; FileSystemTraceStore is the default trace store for campaigns (src/campaign/run-campaign.ts:72) and is consumed by 30+ modules (prm, rl, builder-eval, meta-eval, pipel
Fit with existing patterns: All changes follow the established substrate pattern: primitives exported via index barrel, dependency on shared abstractions (TraceStore interface, Analyst types, TCloud), no competing implementations. The ProductClient.request() centralizes what was previously N ad-hoc fetch calls each lacking timeout/ok-checks — consistent with the fail-loud doctrine in CLAUDE.md. The FileSystemTraceStore appen
Real-world viability: Changes address real failure modes: argHash string-return prevents stuck-loop de-dup from breaking on undefined/null/function args (verified by query.test.ts:22-45); FileSystemTraceStore append mutex prevents read-modify-write races under concurrent writes (verified by store.test.ts); two-pass load handles patches arriving before base rows across rolled files (store.test.ts); CaptureIntegrityError
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added src/executor.ts

     console.log(`    ${judgeName} retry ${attempt}/2 (waiting ${wait / 1000}s)`)

🟡 Cruft: magic number added src/executor.ts

     console.log(`    ${judgeName} retry ${attempt}/2 (waiting ${wait / 1000}s)`)

🎯 Usefulness Audit

🟡 login() and chat() bypass the new request() timeout boundary [integration] ``

ProductClient.request() is documented as 'Single HTTP boundary for every JSON request' (src/client.ts:29), but login() (line 87) and chat() (line 121) still use raw fetch() without AbortController timeout. login() additionally lacks the centralized res.ok→throw pattern (has a manual body.user check instead). Both methods have legitimate reasons for divergence — login() needs redirect: 'manual' + set-cookie, chat() needs response body streaming — so this is not a defect, but the doc comment and a

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260619T142151Z}

…hardening (#258) Version-only bump of the trio (package.json + python pyproject + __init__). Ships everything merged since 0.93.0: - #253 data-plane reliability/concurrency/scale hardening - #254 paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k) - #255 trace-store append/load race + collision-free rollover - #256 McNemar power + required-N - #257 single source of truth for span-timestamp parsing

drewstone added 5 commits June 19, 2026 06:50

tangletools previously approved these changes Jun 19, 2026

View reviewed changes

style(trace): apply biome formatting to tranche-1 files

7944992

Format + import-organize the tranche-1 files that missed the pre-commit formatter; fixes the CI biome lint failure. No behavior change.

drewstone dismissed tangletools’s stale review via 7944992 June 19, 2026 13:30

tangletools previously approved these changes Jun 19, 2026

View reviewed changes

tangletools reviewed Jun 19, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via c9b8cc4 June 19, 2026 13:48

tangletools previously approved these changes Jun 19, 2026

View reviewed changes

tangletools reviewed Jun 19, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via 09ff671 June 19, 2026 14:13

tangletools approved these changes Jun 19, 2026

View reviewed changes

tangletools reviewed Jun 19, 2026

View reviewed changes

drewstone merged commit 4477db5 into main Jun 19, 2026
1 check passed

This was referenced Jun 19, 2026

fix(trace-store): close append/load race + collision-free rollover #255

Merged

chore(release): 0.94.0 #258

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(substrate): data-plane reliability, concurrency & scale hardening#253

fix(substrate): data-plane reliability, concurrency & scale hardening#253
drewstone merged 8 commits into
mainfrom
harden/substrate-reliability

drewstone commented Jun 19, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 19, 2026

What

Verification

Tranches

Method

Residuals (follow-ups, out of scope here)

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 1c91d258

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 79449925

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟡 Value Audit — sound-with-nits

💰 Value — sound-with-nits

🎯 Usefulness — sound

🔎 Heuristic Signals

💰 Value Audit

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — c9b8cc4d

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

💰 Value — sound

🎯 Usefulness — sound

🔎 Heuristic Signals

💰 Value Audit

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 09ff6713

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

💰 Value — sound

🎯 Usefulness — sound

🔎 Heuristic Signals

🎯 Usefulness Audit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `1c91d258`

✅ Auto-approved PR — `79449925`

✅ Auto-approved PR — `c9b8cc4d`

✅ Auto-approved PR — `09ff6713`