fix(trace-store): close append/load race + collision-free rollover#255
Conversation
The remaining real FileSystemTraceStore defects from the integrity audit, each proven by a regression test that fails on the pre-fix code (verified by neutering the fix and watching the test fail): - append-during-load: memoize the index build and make append await an in-flight load before writing, so a row appended mid-load is mirrored into the completed index instead of being permanently lost behind the cached index - rollover naming: strictly-increasing stamp (max(now, last+1)) so two rollovers in the same millisecond get distinct files — a bare Date.now() let the second overwrite and lose the first - corrupt NDJSON now fails loud WITH file+line context instead of a bare SyntaxError that hides which row/corpus is bad
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 975bf837
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T17:08:02Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 1 (1 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 89.0s (2 bridge agents) |
| Total | 89.0s |
💰 Value — sound
Fixes three latent race/collision/diagnostics bugs in FileSystemTraceStore — index-memoize + append-await, strictly-increasing rollover stamps, and actionable corrupt-NDJSON errors — each with a regression test proven to fail on the pre-fix code.
- What it does: Three targeted fixes to FileSystemTraceStore in src/trace/store.ts: (1) memoizes the lazy index build via
indexPromiseand makesappend()await an in-flight load before writing, closing a race where a row appended during a concurrent first-read was permanently invisible behind the cached index; (2) replaces bareDate.now()for rolled-file stamps with `Math.max(Date.now(), lastRolloverStamp + - Goals it achieves: Eliminates permanent data invisibility under concurrent read+write (the append/load race), prevents silent data loss from timestamp collisions during rapid rollover, and makes corrupt-NDJSON failures immediately actionable without grepping through files.
- Assessment: Clean, minimal, and in the codebase's grain. The
indexPromisememoization pattern already exists in the siblingOtlpFileTraceStore(src/trace-analyst/store-otlp.ts:152, 454-458) — this change unifiesFileSystemTraceStorewith that convention. The corrupt-diagnostics pattern mirrorsfileVerdictCache(src/verdict-cache.ts:119-126). The strictly-increasing stamp is novel but trivial — no exis - Better / existing approach: none — this is the right approach. The indexPromise memoization unifies with the pattern already in OtlpFileTraceStore (src/trace-analyst/store-otlp.ts:454-458). The corrupt-diagnostics pattern aligns with verdict-cache's parseCacheLine (src/verdict-cache.ts:119-126). No existing mechanism in the codebase provides collision-free rollover stamps or the append-await-load synchronization, and the imp
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🎯 Usefulness — sound
Three internal reliability fixes to FileSystemTraceStore (append-during-load race, same-ms rollover collision, actionable corrupt-NDJSON diagnostics) with zero API change — regressions tests proven real, all existing tests pass, no competing pattern exists.
- Integration: All three changes are private to FileSystemTraceStore (src/trace/store.ts:184). The class is the production NDJSON persistence implementation exported via src/trace/index.ts:12 into the public barrel (src/index.ts). It is consumed by downstream repos (agent-runtime, agent-knowledge); TraceEmitter (src/trace/emitter.ts:69) and the wire ingestion layer (src/wire/handlers.ts:280) both take a TraceSto
- Fit with existing patterns: The indexPromise ??= memoization (src/trace/store.ts:292) follows the same lazy-init pattern already used by appendLocks (src/trace/store.ts:202). The strictly-increasing rollover stamp (src/trace/store.ts:248) is a minimal, correct one-liner on the existing rollover logic. The enhanced NDJSON error message (src/trace/store.ts:355-359) preserves the existing fail-loud doctrine while adding diagnos
- Real-world viability: The append-during-load fix (await indexPromise before writing, src/trace/store.ts:218) serializes append-after-load correctly — the test (src/trace/store.test.ts:225-238) proves the row survives through both disk write and index mirror. The rollover fix (Math.max(Date.now(), last+1), src/trace/store.ts:248) is strictly increasing even under a mocked frozen clock — test (src/trace/store.test.ts:241
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: magic number added src/trace/store.test.ts
// Distinct stamps (1000, 1001) — a bare Date.now() would collide to one
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 89 | 92 | 89 |
| Confidence | 65 | 65 | 65 |
| Correctness | 89 | 92 | 89 |
| Security | 89 | 92 | 89 |
| Testing | 89 | 92 | 89 |
| Architecture | 89 | 92 | 89 |
Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.
🟡 LOW Inverse race (append before load) still not covered — src/trace/store.ts
Line 218:
if (this.indexPromise) await this.indexPromiseinappend()only covers the case where a load is already in flight when the append starts. The converse — append starts first (passes the check whileindexPromiseis undefined), writes to disk, and THEN abuildIndex()begins — can still skip the mirror ifbuildIndex()reads the file before the append completes. The row is not permanently lost (disk is source of truth for future instances), but it can be invisible to the current instance's index. This is a pre-existing limitation, not introduced by this PR.
🟡 LOW Rejected index build permanently memoized (no retry) — src/trace/store.ts
Line 292:
this.indexPromise ??= this.buildIndex()— oncebuildIndex()rejects, the rejected promise is stored permanently via??=(a rejected Promise is not nullish). The old code'sif (this.loaded && this.index)guard allowed retry on transient errors becauseloadedwas only set totrueat the end ofload(). In practice, index build errors come from corrupt NDJSON (permanent) or transient FS errors (vanishingly rare). The repo philosophy is 'fail loud', so this is acceptable. Consider wrapping in a try/catch that clearsthis.indexPromiseon failure to allow one retry after an explicitreset()or on the next read from a caller that handles the error.
🟡 LOW Rejected indexPromise permanently bricks the store (behavioral change from retryable to unrecoverable) — src/trace/store.ts
Lines 289-294: load() memoizes via
this.indexPromise ??= this.buildIndex(). If buildIndex() throws (e.g., corrupt NDJSON at line 352-359, or a transient fs.readFile EBUSY/ENOENT when a file is deleted between readdir at line 302 and readFile at line 343), indexPromise stays rejected forever. Every subsequent load()
tangletools · 2026-06-19T17:16:48Z · trace
…hardening (#258) Version-only bump of the trio (package.json + python pyproject + __init__). Ships everything merged since 0.93.0: - #253 data-plane reliability/concurrency/scale hardening - #254 paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k) - #255 trace-store append/load race + collision-free rollover - #256 McNemar power + required-N - #257 single source of truth for span-timestamp parsing
What
Two
FileSystemTraceStorereliability fixes surfaced by the integrity audit of #253 but landed after #253 had already squash-merged. Isolated tosrc/trace/store.ts(+ tests); no public signature changed.Fixes (each with a regression test verified to FAIL on the pre-fix code)
append()awaits an in-flight load before writing. A row appended while a load was mid-flight was previously skipped by the mirror (this.indexstill undefined) and made permanently invisible behind the cached index. Now it's written after the load completes and mirrored into the finished index.max(Date.now(), last+1)). A bareDate.now()let two rollovers in the same millisecond produce the same filename, so the secondrenameoverwrote (and lost) the first rolled file.SyntaxErrorthat hides which row/corpus is bad. (Still fail-loud per doctrine — just actionable.)Verification
tscclean,biome check srcclean, merges clean intomain.Provenance
These are the items I had flagged as deferred on #253's final review; they're pre-existing latent bugs (not regressions from #253), now closed.