feat(snapshot-skip B3): cold-start restoreSnapshotState skip gate + CRC verifier + metrics by bootjp · Pull Request #934 · bootjp/elastickv

bootjp · 2026-06-05T07:00:50Z

Summary

Implements Branch 3 of the cold-start snapshot-restore skip optimisation designed in PR #910, building on PR #915 (B2 — metaAppliedIndex plumbing). This is the user-visible perf win of the series: after this lands, the engine restores the FSM only when the on-disk state is actually behind the snapshot pointer; the common case (local restart with a fresh fsm.db) collapses to a header read + CRC verification, skipping the multi-GiB body restore that PR #909's HEALTH_TIMEOUT_SECONDS=300 band-aid was sized to absorb.

Reading order (6 commits, designed for one-at-a-time review)

#	commit	scope
1	`8312f4ba`	`raftengine.SnapshotHeaderApplier` two-phase interface (`ParseSnapshotHeader` + `ApplySnapshotHeader`)
2	`024eb43c`	`kvFSM` implements both phases + compile-time guard
3	`933dab86`	`applyHeaderStateOnSkip` (3-step CRC, mirrors `openAndRestoreFSMSnapshot`) + `restoreSnapshotState` skip gate
4	(folded into #3)	`fsmAlreadyAtIndex` + skip-gate wiring
5	(next)	metrics + INFO log: `raftengine.ColdStartObserver` + `monitoring.ColdStartMetrics` + `Registry.ColdStartObserver()`
6	`ce277e66`	9 new tests (skip-gate behaviour ×5 + CRC failure modes ×3 + kvFSM contract ×1)

(Commit 5 was merged into the metrics commit; the actual git log is 5 commits.)

What this does

restoreSnapshotState branches on decideSkipOutcome:

decision, have := decideSkipOutcome(fsm, tok.Index)
reportColdStart(obs, logger, decision, tok.Index, have)
if decision == coldStartSkip {
    return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C)
}
return openAndRestoreFSMSnapshot(...) // unchanged

decideSkipOutcome returns one of 5 outcomes:

outcome	trigger	action
`skip`	`LastAppliedIndex >= snap.Index`	header-only path
`execute`	`LastAppliedIndex < snap.Index`	full restore
`fallback_not_reader`	FSM doesn't implement `AppliedIndexReader`	full restore
`fallback_missing_meta`	meta key absent (pre-upgrade fsm.db)	full restore
`fallback_read_err`	`LastAppliedIndex` returned an error	full restore

The three fallback paths preserve the strictly-additive guarantee from the design doc §4: over-restoring is always strictly safer than skipping incorrectly. Errors from LastAppliedIndex do not abort cold start — they collapse to fallback.

CRC verification (mirrors `openAndRestoreFSMSnapshot`)

applyHeaderStateOnSkip runs the same three-step safety contract as the full restore path. Failure of any step propagates a typed error WITHOUT mutating FSM state:

Step	Check	Error
1	`info.Size() < fsmMinFileSize`	`ErrFSMSnapshotTooSmall`
2	`readFSMFooter` vs `tokenCRC`	`ErrFSMSnapshotTokenCRC`
3	full-body `crc32.TeeReader` vs footer	`ErrFSMSnapshotFileCRC`

The FSM's ApplySnapshotHeader (HLC ceiling + Stage 8a cutover) only fires after all three pass.

Metrics + INFO log

raftengine.ColdStartObserver interface lives in the parent raftengine package so monitoring can implement it without circular import. monitoring.Registry gains ColdStartObserver(). The engine receives it through OpenConfig.ColdStartObserver — nil disables metrics, the skip itself still runs.

Two Prometheus series:

elastickv_fsm_cold_start_restore_total{outcome,fallback_reason}
elastickv_fsm_cold_start_applied_index_gap{outcome}

Plus a structured zap.Info log on every cold-start with fsm_applied / snapshot_index / gap and (for fallback) the reason enum.

Design constraints honoured

§4 strictly-additive fallback: decideSkipOutcome collapses every uncertainty to a fallback variant; LastAppliedIndex errors do NOT abort cold start.
§5 two-phase seam (round-7): ParseSnapshotHeader reads the v1/v2 header from a caller-supplied reader and drains the rest; ApplySnapshotHeader is pure assignment. Splits the parse from the side-effect so the engine can verify the CRC between them.
§5 CRC mirroring (round-6): 3-step verification (size + footer-vs-tokenCRC + full-body-CRC) exactly matches openAndRestoreFSMSnapshot. The full-body CRC pass IS the slow part of this PR (~6s for a 6 GiB FSM at 1 GiB/s SSD read) — but still strictly cheaper than the restore it replaces, which also reads the file once AND writes a temp Pebble database via restoreBatchLoopInto.
§9 observability: three outcomes + gap label; the design's not_reporter label is named not_reader in the actual impl (matching the round-5 PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 rename).
Non-Goals respected: Engine.applySnapshot (engine.go:1641 InstallSnapshot hot path) is untouched. The TODO sentinel added in PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still names B3 as the follow-up site for the peer-after-InstallSnapshot bump — that's not in scope here either.

Test results

go vet ./internal/raftengine/etcd/ ./internal/raftengine/ ./kv/ ./store/ ./monitoring/   → exit 0
golangci-lint run ... (same packages)                                                     → 0 issues
go test ./internal/raftengine/etcd/ ./kv/ ./store/ ./monitoring/ -short                   → ok ~57s total
go test ./internal/raftengine/etcd/ -run 'TestSkipGate|TestApplyHeaderStateOnSkip'        → ok 0.028s

What this does NOT do

Does NOT change HEALTH_TIMEOUT_SECONDS=300. That's Branch 4, gated on production data showing steady-state skip rate ≥ 90%.
Does NOT wire Engine.applySnapshot to populate metaAppliedIndex for peers receiving InstallSnapshot. The TODO sentinel from PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still flags this; addressing it requires a separate design pass.
Does NOT add an idle-cluster integration test (ELASTICKV_RAFT_SNAPSHOT_COUNT=10 end-to-end). The unit tests cover the gate decision matrix; an integration test would prove the codex round-3 P2 scenario is closed end-to-end in production-like setup, which is most valuable as a follow-up after B3 has soaked.

Soak plan

Same as B2: deploy and observe elastickv_fsm_cold_start_restore_total{outcome="skipped"} ramp up across cold starts. The design target is steady-state skip rate ≥ 90%; the per-outcome label distribution and gap distribution are the primary signals.

After one release of soak, Branch 4 can lower HEALTH_TIMEOUT_SECONDS.

Refs

PR docs(design): idempotent FSM snapshot restore on cold start #910 — design doc (rounds 1-7)
PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 — Branch 2 plumbing (merged); this PR depends on metaAppliedIndex being populated
PR fix(ops/rolling-update): bump HEALTH_TIMEOUT_SECONDS default 60s -> 300s #909 — HEALTH_TIMEOUT_SECONDS band-aid this series eventually obviates

Summary by CodeRabbit

New Features
- Cold-start lifecycle callbacks added and wired through the raft engine to report skip/execute/fallback outcomes and applied-index gaps.
- Prometheus metrics for cold-start outcomes and applied-index gap.
- Snapshot-header read/apply contract introduced; KV FSM implements header apply and volatile-entry classification to support header-only skips and volatile-only replay on duplicates.
Tests
- Extensive unit/integration tests covering skip/execute/fallback decisions, header verification, volatile duplicate replay, and updated restore-call signatures.

Two-phase interface for the cold-start skip path to preserve the snapshot header state (HLC ceiling, Stage 8a cutover) the FSM's Restore would normally apply, without running the multi-GiB body restore. - ParseSnapshotHeader(r) — reads the v1/v2 header from a caller- supplied reader (wrapped in a crc32 TeeReader by the engine) and drains the rest so the wrapping CRC covers the full payload. Returns the parsed (ceiling, cutover) WITHOUT mutating FSM state. Errors propagate; no side-effect on error. - ApplySnapshotHeader(ceiling, cutover) — pure assignment of the verified header state. Engine calls this only after Parse returned and the crc32 hash matched the file footer. Splitting parse from apply lets the CRC verifier stay co-located with its private helpers in internal/raftengine/etcd (matching openAndRestoreFSMSnapshot's safety contract) while the v1/v2 header parser stays inside the kv package where kv.ReadSnapshotHeader already lives. Neither package imports the other in production. Interface only; implementation + skip gate land in subsequent commits. Refs PR #910 design §5 round-7 + post-B2 implementation series.

Implements the two-phase opt-in interface added in step 1. - ParseSnapshotHeader(r) — wraps r in a bufio.Reader, calls kv.ReadSnapshotHeader (handles v1/v2/headerless/unknown magic), drains remaining body bytes via io.Copy(io.Discard, br) so the engine's wrapping crc32 TeeReader covers every payload byte. Returns the parsed (ceiling, cutover) tuple WITHOUT mutating f.hlc or f.restoredCutover — that side-effect waits for phase 2 after the engine confirms the CRC. - ApplySnapshotHeader(ceiling, cutover) — pure assignment of the verified header state. Mirrors the two side-effects from Restore that are scoped to the header portion: SetPhysicalCeiling + restoredCutover. The drain pass is essential: without it, the engine's TeeReader would only hash the first ~26 bytes of the body (whatever ReadSnapshotHeader consumed) and the wrapping CRC check would not correspond to the file footer. With the drain, the CRC matches restoreAndComputeCRC's full-payload boundary in openAndRestoreFSMSnapshot. Added compile-time guard in fsm_applied_index_iface_check.go: var _ raftengine.SnapshotHeaderApplier = (*kvFSM)(nil) so a future rename or signature drift fails 'go build' immediately rather than silently degrading the skip optimisation. Tests: go vet ./kv/ + golangci-lint clean. Skip-gate consumer lands in subsequent commits. Refs PR #910 design §5 round-7 + B3 step 1.

…s 3+4) Wires the user-visible cold-start optimisation from the PR #910 design. After this commit the engine restores the FSM only when strictly necessary; the common case (local restart with a fresh fsm.db) collapses to a header read + CRC verification, skipping the multi-GiB body restore that PR #909's HEALTH_TIMEOUT_SECONDS=300 band-aid was sized to absorb. restoreSnapshotState now branches on fsmAlreadyAtIndex: if fsmAlreadyAtIndex(fsm, tok.Index) { return applyHeaderStateOnSkip(fsm, snapPath, tok.CRC32C) } return openAndRestoreFSMSnapshot(...) // unchanged fsmAlreadyAtIndex returns true ONLY when the FSM exposes raftengine.AppliedIndexReader AND LastAppliedIndex reports (have, true, nil) AND have >= want. Any uncertainty -- FSM doesn't opt in, store reports missing meta key, LastAppliedIndex returns an error -- collapses to false so the caller falls back to the full restore path. Strictly-additive guarantee (PR #910 §4): a wasteful restore is strictly safer than a stale-but-incorrect skip. applyHeaderStateOnSkip mirrors openAndRestoreFSMSnapshot's three- step safety contract exactly: 1. fsmMinFileSize -> ErrFSMSnapshotTooSmall 2. readFSMFooter vs tokenCRC -> ErrFSMSnapshotTokenCRC 3. crc32 TeeReader over the full payload -> ErrFSMSnapshotFileCRC Only after all three pass does the FSM's ApplySnapshotHeader run (pure assignment of ceiling + cutover). FSMs that do not implement raftengine.SnapshotHeaderApplier (legacy test fakes) still get the CRC verification but no header side-effects -- they have no header state to carry forward. Helpers extracted to keep cyclomatic complexity under 10: - verifyFSMSnapshotPrefix: steps 1 + 2 (size + footer-vs-token). - readSnapshotHeaderOrDrain: step 3's seam-vs-no-seam branch (delegates to setter.ParseSnapshotHeader OR drains io.Discard through the tee). Imports gained: 'hash/crc32' and the parent raftengine package (for the AppliedIndexReader + SnapshotHeaderApplier interfaces). Tests: go vet + golangci-lint clean across etcd/, kv/, store/; existing test suite still green (internal/raftengine/etcd 21.5s, kv 9.5s, store 23.9s). Refs PR #910 design §4 + §5 round-7; closes the user-visible cold-start side of the implementation series.

…(B3 step 5) Wires Prometheus metrics + structured INFO logs around the restoreSnapshotState skip decision so operators can verify the optimisation engages in production (design target: >= 90% skip rate in steady state after B2's soak phase). internal/raftengine/cold_start.go — new ColdStartObserver interface with three methods (RestoreSkipped / RestoreExecuted / RestoreFallback). Lives in the parent raftengine package so monitoring can implement it without circular import. Contract documents the strictly-additive fallback reason enum (not_reader / missing_meta / read_err) and the no-blocking requirement (cold- start critical path). internal/raftengine/etcd/engine.go — OpenConfig gains an optional ColdStartObserver field. nil disables metrics; the skip itself still runs. internal/raftengine/etcd/wal_store.go — restoreSnapshotState signature gains (obs, logger) parameters. The body switches from the bool fsmAlreadyAtIndex to a coldStartDecision enum (skip / execute / fallback-{not_reader,missing_meta,read_err}) returned by decideSkipOutcome. reportColdStart dispatches to the observer + emits a zap.Info log on every cold-start. - 'restoreSnapshotState skipped' with fsm_applied / snapshot_index / gap_ahead. The user-visible perf win logged once per restart. - 'restoreSnapshotState executed (FSM behind snapshot)' with gap_behind so operators can see how stale the FSM was. - 'restoreSnapshotState fallback to full restore' with reason so operators can chase 'why is the optimisation not engaging' in production. monitoring/cold_start.go — newColdStartMetrics + ColdStartObserver struct. Two series: elastickv_fsm_cold_start_restore_total{outcome,fallback_reason} elastickv_fsm_cold_start_applied_index_gap{outcome} (See doc-comments for the precise label semantics.) monitoring/registry.go — Registry gains ColdStartObserver() accessor. main.go (next commit, not in this PR) wires it to the engine via OpenConfig. Existing test call sites of restoreSnapshotState (wal_store_test.go * 8 + grpc_transport_test.go * 1) updated to pass (nil, nil) — tests opt out of metrics. The legacy bootstrap- path loadWalState recursion in wal_store.go:522 also passes nil because OpenConfig wiring has not yet reached that call. Tests + lint: go vet + golangci-lint clean across all touched packages; go test ./internal/raftengine/etcd/ ./kv/ ./store/ ./monitoring/ -short ok 55s total. Refs PR #910 design §9. B3 step 5 of the implementation series.

…tep 6) Nine new tests covering the Branch 3 skip gate end-to-end: skip-gate behaviour (5): - TestSkipGate_SkipsWhenFSMFreshEnough — applied >= snap.Index triggers the skip path; FSM body stays empty; observer records one RestoreSkipped with gap_ahead; ApplySnapshotHeader fires with the parsed ceiling. - TestSkipGate_ExecutesWhenFSMStale — applied < snap.Index runs the full restore; ApplySnapshotHeader does NOT fire; observer records one RestoreExecuted with gap_behind. - TestSkipGate_FallbackMissingMeta — appliedPresent=false falls back to full restore with reason='missing_meta'. - TestSkipGate_FallbackReadErr — LastAppliedIndex error collapses to fallback (NOT propagated) with reason='read_err'. - TestSkipGate_FallbackNotReader — dummyFSM (no AppliedIndexReader) falls back with reason='not_reader'. CRC failure modes (3): - TestApplyHeaderStateOnSkip_TruncatedFile — empty file triggers step 1 ErrFSMSnapshotTooSmall; FSM state unchanged. - TestApplyHeaderStateOnSkip_WrongTokenCRC — bad tokenCRC arg triggers step 2 ErrFSMSnapshotTokenCRC; FSM state unchanged. - TestApplyHeaderStateOnSkip_BodyCorruption — XOR one byte of the body in-place; step 1 + 2 pass but step 3 (full-body CRC) catches it via ErrFSMSnapshotFileCRC; FSM state unchanged. kvFSM contract (1): - TestSkipGate_KVFSMHeaderRoundTrip — production *kv.kvFSM satisfies SnapshotHeaderApplier; v1 header (EKVTHLC1 + ceilingMs BE) is parsed and applied; hlc.PhysicalCeiling() reflects the ceiling; RestoredCutover() is 0 for v1. Test doubles: - skipGateFSM implements StateMachine + AppliedIndexReader + SnapshotHeaderApplier; configurable via field assignment. - recordingObs implements raftengine.ColdStartObserver; records every callback for assertion. Test results: - go test ./internal/raftengine/etcd/ -run 'TestSkipGate|TestApplyHeaderStateOnSkip' -count=1 -> ok 0.028s - Full short test sweep across etcd/, kv/, store/, monitoring/ -> ok ~57s total. - go vet + golangci-lint clean across all touched packages. Closes B3 implementation series. Refs PR #910 design §4 / §5 / §9 test list.

coderabbitai · 2026-06-05T07:00:58Z

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 43 minutes and 8 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d8b6e9eb-7059-4241-8cc6-060f5f4eadfa

📥 Commits

Reviewing files that changed from the base of the PR and between 20f90a9 and 58db40c.

📒 Files selected for processing (3)

internal/raftengine/etcd/cold_start_volatile_replay_test.go
internal/raftengine/etcd/engine.go
internal/raftengine/etcd/wal_store.go

📝 Walkthrough

Walkthrough

This PR adds a cold-start skip-gate that can avoid full FSM snapshot restores by applying verified snapshot headers when safe, wires a ColdStartObserver through factory/engine/metrics, enforces advance-only applied-index updates, and includes Prometheus metrics plus comprehensive tests for skip/execute/fallback and CRC/header verification.

Changes

Cold-Start Snapshot Skip-Gate Feature

Layer / File(s)	Summary
Contract definitions: ColdStartObserver and SnapshotHeaderApplier `internal/raftengine/cold_start.go`, `internal/raftengine/statemachine.go`	`ColdStartObserver` declares RestoreSkipped/RestoreExecuted/RestoreFallback; `SnapshotHeaderApplier` and `VolatileEntryClassifier` are added as optional FSM seams.
kvFSM header parse/apply and volatile classifier `kv/fsm.go`, `kv/fsm_applied_index_iface_check.go`	`kvFSM` implements ParseSnapshotHeader/ApplySnapshotHeader and IsVolatileOnlyPayload; compile-time assertions ensure interface conformance.
Factory and engine config wiring `internal/raftengine/etcd/factory.go`, `internal/raftengine/etcd/engine.go`	Adds ColdStartObserver fields to FactoryConfig/OpenConfig and threads the observer into engine Open during factory creation.
Cold-start metrics and registry integration `monitoring/cold_start.go`, `monitoring/registry.go`	Adds `ColdStartMetrics` (outcome counter + applied-index gap gauge) and a `ColdStartObserver` implementation; Registry constructs and exposes the observer via `ColdStartObserver()`.
Application startup and test integration `main.go`, `main_bootstrap_e2e_test.go`, `main_bootstrap_test.go`, `multiraft_runtime_test.go`	Reorders startup to build the metrics registry before creating the raft factory and passes `metricsRegistry.ColdStartObserver()` into `newRaftFactory`; test call sites adjusted to the new signature.
Engine applied-index seeding & advance-only guard `internal/raftengine/etcd/engine.go`	Engine seeding now uses the cold-start effective applied index, `setApplied` only advances strictly forward, and duplicate normal entries are routed to drop-or-replay logic for volatile payloads.
WAL/snapshot restore and skip-gate implementation `internal/raftengine/etcd/wal_store.go`	`loadWalState`/`restoreSnapshotState` return an effective applied override, decide skip/execute/fallback, verify token/footer/body CRCs for header-only skip, apply header via `SnapshotHeaderApplier` when supported, and report outcomes via the observer.
Skip-gate and header verification tests `internal/raftengine/etcd/wal_store_skip_gate_test.go`	Adds test doubles and tests validating skip/execute/fallback decisioning, WAL post-snapshot rules, header CRC/size verification failures, fallback reasons, and kvFSM header round-trip behavior.
Other test updates `internal/raftengine/etcd/wal_store_test.go`, `internal/raftengine/etcd/grpc_transport_test.go`, `internal/raftengine/etcd/encryption_test.go`, `internal/raftengine/etcd/cold_start_volatile_replay_test.go`, `monitoring/cold_start_test.go`	Updates restoreSnapshotState and applyNormalEntry call signatures, adds volatile-replay tests, and adds observer metric unit tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

bootjp/elastickv#910: Related cold-start skip-gate design and snapshot-header seam implementation.
bootjp/elastickv#747: Streaming-token snapshot changes that interact with the updated restoreSnapshotState signature and token CRC handling.
bootjp/elastickv#915: Related durable applied-index plumbing and AppliedIndexReader/Writer work that the skip-gate relies upon.

Poem

A rabbit hops through snapshots bright,
Skipping full restores when FSM's just right. 🐰
Headers checked and metrics spun,
Observers count the work we've done. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly and specifically describes the main change: implementing a cold-start snapshot-restore skip optimization with CRC verification and metrics.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch impl/snapshot-skip-b3-skip-gate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-05T07:01:04Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (ce277e6):

kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

bootjp · 2026-06-05T07:01:11Z

@claude review

Branch 3 of the PR #910 design — the user-visible cold-start perf win. Follows the merged Branch 2 (PR #915 — metaAppliedIndex plumbing). 6 commits, focused review pathways:

Review observation surface:

§4 strictly-additive fallback: decideSkipOutcome collapses every uncertainty (FSM not a reader, meta key missing, read error) to a fallback enum; LastAppliedIndex errors MUST NOT abort cold start; tests TestSkipGate_Fallback* cover all three reasons.
§5 two-phase seam (round-7): ParseSnapshotHeader(r io.Reader) (ceiling, cutover uint64, err error) + ApplySnapshotHeader(ceiling, cutover uint64). Engine wraps r in a crc32 TeeReader; FSM Parse reads + drains; engine verifies CRC; engine calls Apply ONLY on verification success. The body bytes are read but discarded (fsm.db already has equivalent state).
§5 CRC three-step (round-6): size + footer-vs-tokenCRC + full-body-CRC, byte-for-byte matching openAndRestoreFSMSnapshot's safety contract. Tests TestApplyHeaderStateOnSkip_* inject each failure mode and assert FSM state unchanged.
§6 nosync mode: not relevant here (B3 reads only; B2 handled the write side).
§9 observability: 3 outcomes + gap label + structured INFO log. raftengine.ColdStartObserver interface in the parent package so monitoring can implement without circular import.
Compile-time guards: var _ raftengine.SnapshotHeaderApplier = (*kvFSM)(nil) added next to the existing AppliedIndexReader/Writer guards — a rename or signature drift fails go build rather than silently degrading the skip to full-restore-always.
Non-Goals respected: Engine.applySnapshot untouched (peer-after-InstallSnapshot bump remains a B4 / follow-up; the TODO sentinel from PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still names it).
Caller audit (per /loop semantic-change rule): restoreSnapshotState signature changed (added obs + logger). All in-package callers updated; 9 test call sites updated to pass (nil, nil) — tests opt out of metrics.

chatgpt-codex-connector · 2026-06-05T07:01:15Z

To use Codex here, create a Codex account and connect to github.

claude · 2026-06-05T07:01:23Z

Claude finished @bootjp's task in 4m 16s —— View job

PR #934 Review — Branch 3: Cold-Start Snapshot-Restore Skip Gate

Todo list:

Read changed files and understand the implementation
Review decideSkipOutcome / strictly-additive fallback (§4)
Review two-phase seam: ParseSnapshotHeader + ApplySnapshotHeader (§5 round-7)
Review CRC three-step verification (§5 round-6)
Review observability (ColdStartObserver + metrics) (§9)
Review compile-time guards
Review test coverage
TLA+ spec divergence check (kv/fsm.go anchor)
Final summary

Overall assessment

The implementation is correct and the core design invariants are faithfully preserved. The three design constraints given in the trigger — strictly-additive fallback (§4), two-phase seam (§5/round-7), and CRC mirroring (§5/round-6) — are all correctly implemented. Four findings below, in descending severity.

Finding 1 — `ParseSnapshotHeader` error short-circuits before the full-body CRC is verified (design trade-off to document)

Location: internal/raftengine/etcd/wal_store.go:409-415

ceiling, cutover, err := readSnapshotHeaderOrDrain(setter, hasSetter, tee)
if err != nil {
    return err   // ← returns header error; h.Sum32() is never reached
}

if h.Sum32() != footer {
    return errors.Wrapf(ErrFSMSnapshotFileCRC, ...)
}

If ParseSnapshotHeader returns an error (e.g., ErrSnapshotHeaderUnknownMagic), the full-body CRC step is skipped. This is not a safety bug — the header error propagates, so the cold start correctly fails closed and FSM state is not mutated. However, it means a file that passes the step-2 (footer-vs-token) check but has a corrupt header surfaces as a header format error rather than ErrFSMSnapshotFileCRC.

This differs subtly from openAndRestoreFSMSnapshot where restoreAndComputeCRC runs the full-body CRC before fsm.Restore can even report a parse error. The skip path inverts the order: header parse first (inside the CRC pass), CRC checked after. The design deliberately separates parse from apply, so this ordering is intentional. A one-line comment explaining why early-return on parse error is acceptable here would help future readers avoid "fixing" this to match the full-restore flow.

Finding 2 — `have == want` boundary case is not tested

Location: internal/raftengine/etcd/wal_store_skip_gate_test.go

TestSkipGate_SkipsWhenFSMFreshEnough uses appliedIdx = 200, snapIndex = 100 (well ahead). TestSkipGate_ExecutesWhenFSMStale uses appliedIdx = 100, snapIndex = 200 (well behind). The default branch in decideSkipOutcome fires at have >= want, so have == want is a skip. This boundary case matters for two reasons:

The observer records gap_ahead = have - snapIndex = 0 — a zero gap is valid, but needs explicit confirmation.
The appliedIndexGap gauge would be set to 0, which Prometheus will emit and operators should expect.

[Fix this →](https://claude.ai/code?q=Add%20TestSkipGate_SkipsAtExactBoundary%20to%20internal%2Frightengine%2Fetcd%2Fwal_store_skip_gate_test.go%20on%20branch%20impl%2Fsnapshot-skip-b3-skip-gate.%20Use%20appliedIdx%20%3D%20snapIndex%20%3D%20100%2C%20verify%20restoreSnapshotState%20returns%20nil%2C%20fsm.bodyBytes%20is%20empty%2C%20fsm.restoredHeader%20is%20true%2C%20and%20obs.skipped%20%3D%20%5B%5D uint64%7B0%7D&repo=bootjp/elastickv)

Finding 3 — `RestoreFallback` discards `snapIndex`; the `appliedIndexGap` metric emits nothing for fallback — confirm intentional

Location: monitoring/cold_start.go:87-93

func (o *ColdStartObserver) RestoreFallback(snapIndex uint64, reason string) {
    _ = snapIndex    // ← explicitly discarded
    if o == nil || o.metrics == nil {
        return
    }
    o.metrics.restoreTotal.WithLabelValues("fallback", reason).Inc()
    // no appliedIndexGap update
}

This is correct per the design ("store could not authoritatively name a value"). But operators watching only elastickv_fsm_cold_start_applied_index_gap may see a missing data point and confuse it with a broken metric rather than an expected fallback. A Prometheus absente() alert on outcome=~"skipped|executed" would be the right operational guidance. Consider adding this to the runbook or alerting notes, not to the code.

Finding 4 — TLA+ spec: no divergence

kv/fsm.go anchor (applyHLCLease / SetPhysicalCeiling):

The new ApplySnapshotHeader calls:

if f.hlc != nil && ceiling > 0 {
    f.hlc.SetPhysicalCeiling(int64(ceiling))
}

This is a cold-start-only path, invoked before the Raft event loop starts. It sets the same ceiling that Restore → ReadSnapshotHeader → SetPhysicalCeiling would have set in the full-restore path. The HLC-4 preconditions are unaffected:

(i) bounded skew — ceiling is read from the persisted snapshot, not the local wall clock.
(ii) logical-counter handoff on leader change — not applicable at cold start; leader election has not happened yet.
(iii) commit-time ceiling fence — the skip path issues no writes.

No OCC, MVCC visibility, or routing-catalog invariants are touched. No spec update required.

Minor notes (no action required)

//nolint:mnd on 1<<20 in ParseSnapshotHeader (kv/fsm.go): Correct given the cross-package constant boundary — fsmRestoreReadAhead lives in internal/raftengine/etcd which kv cannot import. The inline literal matches the value. ✅
fallback_reason label is present but "" for skipped/executed outcomes — This is idiomatic Prometheus (sparse label rather than absent label) and avoids cardinality explosion. Consistent with conventions used elsewhere in the codebase. ✅
Seek-back symmetry: applyHeaderStateOnSkip correctly calls file.Seek(0, io.SeekStart) after verifyFSMSnapshotPrefix leaves the file cursor after the footer, mirroring openAndRestoreFSMSnapshot:294. ✅
bootstrapNewCluster / persistBootState pass nil observer — correct; the bootstrap snapshot restore is a no-op for fresh clusters, and the comment in persistBootState explicitly explains why no observer is needed. ✅
monitoring/hlc_test.go alignment-only diff (column alignment) — no logic change. ✅

Summary

Core correctness is solid. Finding 1 (header-error vs CRC-error ordering) is a design consequence worth documenting inline. Finding 2 (boundary test at have == want) is the only gap in coverage. Finding 3 is an operational note. No spec divergence.

gemini-code-assist

Code Review

This pull request introduces a cold-start snapshot-restore skip-gate optimization. It allows skipping the restoration of a multi-GiB snapshot body if the on-disk state machine (FSM) is already as fresh as or fresher than the snapshot. The changes include the definition of the ColdStartObserver and SnapshotHeaderApplier interfaces, integration into the WAL loading and FSM restore paths, implementation of Prometheus metrics for monitoring skip-gate outcomes, and comprehensive unit tests. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce277e6667

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-05T07:06:59Z

+	// nil disables metrics; the skip itself still runs. See
+	// docs/design/2026_06_02_idempotent_snapshot_restore.md §9 and
+	// internal/raftengine/cold_start.go for the contract.
+	ColdStartObserver raftengine.ColdStartObserver


Plumb the cold-start observer through the factory

In the production startup path this observer is never populated: main.go creates monitoring.NewRegistry(...), but engines are opened through internal/raftengine/etcd.Factory.Create, and that method builds OpenConfig without a ColdStartObserver field because raftengine.FactoryConfig has no way to carry it. As a result loadWalState always receives obs == nil in normal server runs, so the new cold-start counters/gauges remain empty even when skips/executions/fallbacks happen. Please thread Registry.ColdStartObserver() through the factory config/open call.

Useful? React with 👍 / 👎.

PR #934 round-1 codex P2: in production, monitoring.NewRegistry's ColdStartObserver was never reaching the engine because internal/raftengine/etcd.Factory.Create built OpenConfig solely from the engine-agnostic raftengine.FactoryConfig (which has no place to carry the observer). The Branch 3 cold-start counters/gauges therefore stayed at zero in real runs even when skips/executions/fallbacks happened. Thread it via the engine-specific etcd.FactoryConfig (already holds TickInterval/HeartbeatTick/etc.) so the change is local to the etcd package and main.go — buildShardGroups, buildRuntimeForGroup, and the shared raftengine.FactoryConfig stay untouched. Wiring: - etcd.FactoryConfig gains ColdStartObserver raftengine.ColdStartObserver - Factory.Create copies it into OpenConfig{} - newRaftFactory in main.go now takes the observer arg and passes metricsRegistry.ColdStartObserver() through - moved the factory construction below metricsRegistry creation so the observer is available at factory build time Tests: - 3 existing newRaftFactory callers updated to pass nil (no metrics needed in those test scopes) - TestSkipGate_*/TestApplyHeaderStateOnSkip_*/TestSkipGate_KVFSMHeaderRoundTrip (9 tests) still pass — they exercise restoreSnapshotState directly with a recording observer test double, unaffected by the factory plumbing Verified: go vet ./..., golangci-lint run ./internal/raftengine/etcd/... ./monitoring/... ./internal/raftengine/... ./., and go test -count=1 ./internal/raftengine/etcd/ all clean.

bootjp · 2026-06-05T08:17:40Z

@claude review

Round-1 codex P2 addressed (commit 320d06e):

monitoring.Registry.ColdStartObserver() now flows through to OpenConfig.ColdStartObserver in production:

etcd.FactoryConfig gained ColdStartObserver raftengine.ColdStartObserver
Factory.Create copies it into the OpenConfig{} literal at factory.go:68
main.go reorders so metricsRegistry is built before newRaftFactory, then passes metricsRegistry.ColdStartObserver() in
Three newRaftFactory callers in tests pass nil (those scopes do not exercise metrics)

Kept the change inside the engine-specific etcd.FactoryConfig rather than the shared raftengine.FactoryConfig to avoid touching buildShardGroups, buildRuntimeForGroup, and unrelated test wiring.

Verified: go vet ./..., golangci-lint run (0 issues across raftengine/etcd, monitoring, raftengine, root), go test ./internal/raftengine/etcd/ green, all 9 TestSkipGate_* / TestApplyHeaderStateOnSkip_* still pass.

github-actions · 2026-06-05T08:17:47Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (320d06e):

kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

claude · 2026-06-05T08:17:52Z

Claude finished @bootjp's task in 2m 33s —— View job

PR #934 Round-2 Review — ColdStartObserver factory wiring (commit `320d06e`)

Todo list:

Load tools
Read changed files (factory.go, main.go, engine.go, cold_start.go, registry.go)
Verify ColdStartObserver flows correctly end-to-end in production path
Check test callers pass nil cleanly
Re-verify Round-1 findings status (boundary test, header-error comment)
Post final review

Codex P2 — Closed correctly ✓

The full wiring chain is intact:

Step	Location	Verified
`NewRegistry` builds `coldStartObs`	`monitoring/registry.go:57`	✓
`Registry.ColdStartObserver()` returns it	`monitoring/registry.go:65-70`	✓
`metricsRegistry` constructed before factory call	`main.go:323` vs `main.go:331`	✓
`newRaftFactory` takes `coldStartObs raftengine.ColdStartObserver`	`main.go:53`	✓
`FactoryConfig.ColdStartObserver` carries it	`factory.go:26`	✓
`Factory.Create` copies into `OpenConfig{ColdStartObserver: ...}`	`factory.go:68`	✓
`Open` passes `cfg.ColdStartObserver` to `loadWalState`	`wal_store.go:43`	✓
`loadWalState` → `restoreSnapshotState` → `reportColdStart`	`wal_store.go:135, 264`	✓
`reportColdStart` guards every dispatch with `obs != nil`	`wal_store.go:329,340,354`	✓

Test callers: main_bootstrap_test.go:94, main_bootstrap_e2e_test.go:354, multiraft_runtime_test.go:54,109 all pass nil correctly — tests don't exercise metrics.

Bootstrap recursive path (wal_store.go:527): passes nil with an explanatory comment — this is the fresh-bootstrap path that runs before OpenConfig wiring, and a fresh-bootstrap restore is a no-op anyway. Correct.

Nil-interface safety: reportColdStart guards with obs != nil (interface-level check), which is the right check for an interface type. *ColdStartObserver methods additionally check o == nil for the concrete receiver, providing defence-in-depth if the observer is ever passed as a non-nil interface wrapping a nil pointer. ✓

One minor note (no action required)

factory.go:24 and main.go:325-330 both reference "PR #934 round-1 codex P2". Per the codebase's comment convention, PR references in inline comments rot as the codebase evolves; the stable reference is docs/design/2026_06_02_idempotent_snapshot_restore.md §9. The information is accurate now but won't age well. Suggest replacing with the design-doc reference on cleanup, though it's not blocking.

Round-1 findings status

Finding	Status
F1 — document header-error-before-CRC ordering in `applyHeaderStateOnSkip`	Open (minor, documentation only)
F2 — add `TestSkipGate_SkipsAtExactBoundary` (`have == want`, gap=0)	Open (test completeness)
F3 — operational note on `appliedIndexGap` absence for fallback	Acknowledged, no code change needed
F4 — TLA+ spec: no divergence	Confirmed
Codex P2 — ColdStartObserver not plumbed in production path	Closed by `320d06e` ✓

Overall

P2 is closed correctly. The change is minimal and surgical — only FactoryConfig, Factory.Create, and the main initialisation order needed touching, which is exactly the right scope. No new issues introduced. The two remaining open findings (F1 comment, F2 boundary test) are cosmetic/test-completeness items that don't affect correctness or safety.

chatgpt-codex-connector · 2026-06-05T08:17:53Z

To use Codex here, create a Codex account and connect to github.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 320d06efb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-05T08:23:26Z

+		decision, have := decideSkipOutcome(fsm, tok.Index)
+		reportColdStart(obs, logger, decision, tok.Index, have)
+		if decision == coldStartSkip {
+			return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C)


Prevent skipping when WAL will replay newer entries

When the FSM has applied entries after the latest persisted snapshot (the normal interval between snapshots, since metaAppliedIndex advances on each Apply), LastAppliedIndex() > tok.Index takes this skip path even though Open still initializes the raft applied floor from LocalSnap.Metadata.Index and openAndReadWAL reloads the post-snapshot entries. Those entries are then replayed into a Pebble store that already contains them; transaction replays can hit OCC conflicts before kvFSM observes their commit timestamps, leaving the restarted node’s HLC below timestamps already present on disk. The skip needs to advance/suppress replay through have, or only skip when there are no post-snapshot WAL entries to replay.

Useful? React with 👍 / 👎.

… (codex P1 #934) When the FSM has applied entries past the latest persisted snapshot (the normal interval between snapshots — metaAppliedIndex advances on each Apply), the prior skip path fired on LastAppliedIndex() > tok.Index even though Open still initializes raft from LocalSnap.Metadata.Index and openAndReadWAL replays the post-snapshot WAL entries. Those entries (tok.Index+1 .. have) would re-apply onto a Pebble store that already contains them, hitting OCC conflicts in the transaction commit path before kvFSM observes the commit timestamps, and leaving the HLC ceiling below timestamps already on disk. Fix: reorder loadWalState so the WAL is opened BEFORE the skip-gate decision. Compute lastWalIndex (max entry Index in the WAL tail, or snapshot.Metadata.Index when empty) and gate skip on 'have >= lastWalIndex' instead of 'have >= tok.Index'. When the WAL still carries entries beyond have, the gate falls through to the full-restore path, which then runs through the normal apply path without double-applies. Signature: restoreSnapshotState gains a lastWalIndex uint64 parameter. reportColdStart now takes (snapIndex, target, have) so the structured-log fields capture both 'snapshot_index' (original tok.Index for operator context) and 'last_wal_index' (the threshold actually used), and gap_ahead / gap_behind are computed against the threshold. Caller audit (signature change): - restoreSnapshotState production caller: loadWalState only. Updated. - Test call sites in wal_store_test.go (8), wal_store_skip_gate_test.go (5), grpc_transport_test.go (1) all updated to pass snap.Metadata.Index as the third argument — preserves the original test semantics where the threshold equals the snapshot index. - reportColdStart is private to this file; single call site updated. Regression: TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries - snap.Index=100, fsm.applied=150, lastWalIndex=200; the skip path would previously fire (have>snapIndex) but must now execute because have<lastWalIndex. Verifies fsm.bodyBytes (full restore ran), observer.executed=[lastWalIndex-have], and no skip recorded. Observer interface unchanged — RestoreSkipped/RestoreExecuted still take (snapIndex, have), but the value passed for 'snapIndex' is now the threshold (lastWalIndex). This keeps the metric semantically meaningful as 'gap relative to where the WAL replay would advance to'.

github-actions · 2026-06-07T06:35:01Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (57467ed):

kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

chatgpt-codex-connector · 2026-06-07T06:35:08Z

To use Codex here, create a Codex account and connect to github.

bootjp · 2026-06-07T06:35:14Z

@claude review

Addresses codex P1 #934 (internal/raftengine/etcd/wal_store.go) — cold-start skip gate would re-apply post-snapshot WAL entries.

Bug: when FSM applied > tok.Index (normal interval between snapshots — metaAppliedIndex advances each Apply), the skip path fired on have >= tok.Index. But Open still initializes raft's applied floor from LocalSnap.Metadata.Index and openAndReadWAL replays the post-snapshot WAL entries. Those entries (tok.Index+1 .. have) re-applied onto a Pebble store that already contained them → OCC conflicts in transaction commit + HLC ceiling left below timestamps already on disk.

Fix: reorder loadWalState so the WAL is opened BEFORE the skip-gate. Compute lastWalIndex (max entry Index in the WAL tail, or snapshot.Metadata.Index when empty) and gate skip on have >= lastWalIndex. When the WAL still carries entries beyond have, the gate falls through to full restore which runs through the normal apply path without double-applies.

Caller audit: restoreSnapshotState gains lastWalIndex uint64 param. Single production call site (loadWalState) + 14 test call sites updated. reportColdStart extended to log both snapshot_index (operator context) and last_wal_index (actual threshold).

Regression: TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries — snap=100, applied=150, lastWalIndex=200; would previously skip (have>snapIndex) but must execute because have<lastWalIndex.

go test + golangci-lint clean on ./internal/raftengine/etcd/.

claude · 2026-06-07T06:35:26Z

Claude finished @bootjp's task in 9m 47s —— View job

PR #934 Round-3 Review — Gate cold-start skip on `lastWalIndex` (codex P1, commit `57467ed`)

Todo list:

Load tools
Read wal_store.go (skip gate + loadWalState reorder)
Read wal_store_skip_gate_test.go (new regression test)
Verify fix correctness (WAL open-before-skip, lastWalIndex computation)
Check edge cases (empty WAL, exact boundary, WAL == snapIndex)
Verify reportColdStart extension (snapshot_index + last_wal_index logging)
TLA+ spec: kv/fsm.go anchor — no new HLC/OCC divergence
Post review

Overall assessment

The fix is correct. The gate change from have >= tok.Index to have >= lastWalIndex closes a genuine invariant violation. The underlying assumption is sound: kvFSM.Apply is fully idempotent — transaction commits via commitApplyStartTS + applyCommitWithIdempotencyFallback (kv/fsm.go:956,994), and raw writes via the strict ts > startTS conflict check (store/lsm_store.go:1305) mean re-delivering WAL entries tok.Index+1..lastWalIndex after a skip causes no OCC conflicts. Three findings below.

Finding 1 — `ColdStartObserver` interface docstring stale after semantic change (Minor)

Locations: internal/raftengine/cold_start.go:12-24, monitoring/cold_start.go:37-41,63-74

The interface was added when the skip gate compared against tok.Index. Now RestoreSkipped / RestoreExecuted receive target = lastWalIndex, not snapshot.Metadata.Index. Three places still describe the gap as relative to the snapshot pointer:

internal/raftengine/cold_start.go (parameter name + doc):

// RestoreSkipped: `gap = haveAppliedIndex -
//     snapshot.Metadata.Index` (how far ahead the live store was).
RestoreSkipped(snapIndex, haveAppliedIndex uint64)  // ← "snapIndex" is misleading; value is lastWalIndex

monitoring/cold_start.go (Prometheus Help string, line 37):

Help: "…outcome=skipped means the gate fired (LastAppliedIndex >= snapshot.Index)…"
// Should be: LastAppliedIndex >= lastWalIndex (WAL tail)

monitoring/cold_start.go (method comments, lines 63-74):

// RestoreSkipped records a successful skip. gap = have - snapIndex
// RestoreExecuted …gap = snapIndex - have
// Both are correct in arithmetic, but "snapIndex" is now the WAL tail, not the snapshot pointer.

Operators reading the Prometheus metric description will infer "distance from snapshot" when the true meaning is "distance from WAL tail." When lastWalIndex >> tok.Index, the gap is materially smaller than what the description implies.

Fix this →

Finding 2 — Regression test covers execute path but not the skip path with non-empty WAL (Minor)

Location: internal/raftengine/etcd/wal_store_skip_gate_test.go

The new TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries (snap=100, applied=150, lastWalIndex=200) correctly pins the bug scenario. However the skip path is still only tested with an empty WAL tail (lastWalIndex = snap.Metadata.Index in TestSkipGate_SkipsWhenFSMFreshEnough).

The common production path — WAL has entries AND the skip fires — is untested:

snap.Index=100, applied=200, lastWalIndex=200   ← WAL entries 101..200 exist; have≥lastWalIndex → skip

This should verify: applyHeaderStateOnSkip ran, fsm.bodyBytes is empty, obs.skipped = [200-200=0]. The gap-zero case also exercises have-target=0 in reportColdStart (the underflow-safe path).

(This is the same test requested in Round-1 F2, now with a clearer motivation since the skip path with non-empty WAL is the primary production benefit of the fix.)

Fix this →

Finding 3 — Correct but worth noting: empty-WAL and non-empty-WAL skip gates are now semantically unified

loadWalState:150-153 — lastWalIndex initialises to snapshot.Metadata.Index and is overridden only when WAL entries are present:

lastWalIndex := snapshot.Metadata.Index
if n := len(entries); n > 0 && entries[n-1].Index > lastWalIndex {
    lastWalIndex = entries[n-1].Index
}

When entries are empty, lastWalIndex = tok.Index, which means restoreSnapshotState(…, tok.Index, …) — identical to the pre-P1 semantics for the zero-WAL case. ✓ Existing test semantics preserved.

WAL close on error path — clean ✓

The new error path at lines 154-162 correctly closes the WAL handle when restoreSnapshotState fails. This matches the existing pattern used after newMemoryStorage failure at line 170-176. No resource leak. ✓

TLA+ spec — no divergence ✓

kv/fsm.go anchor: the changes in this commit are confined to loadWalState / restoreSnapshotState / decideSkipOutcome / reportColdStart in wal_store.go. No HLC issuance path, OCC commit-ts ordering, MVCC visibility, or routing-catalog logic is touched. ApplySnapshotHeader and ParseSnapshotHeader are unchanged. HLC-4 preconditions (i)(ii)(iii) remain intact. No spec update required.

Round summary

Finding	Severity	Status
F1 — `ColdStartObserver` docstrings + Prometheus Help stale (`snapIndex` → WAL tail)	Minor	Open
F2 — skip path with non-empty WAL untested (`have=lastWalIndex>tok.Index`)	Minor	Open
F3 — WAL close on error path	n/a	✓ Clean
TLA+ spec divergence	n/a	✓ None
Codex P1 core fix	—	✓ Correct

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 57467ed085

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T06:37:42Z

+		decision, have := decideSkipOutcome(fsm, lastWalIndex)
+		reportColdStart(obs, logger, decision, tok.Index, lastWalIndex, have)
+		if decision == coldStartSkip {
+			return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C)


Advance Raft's applied floor when skipping the restore

When a persisted snapshot is at index S and the FSM has already applied through the WAL tail T, this new condition selects the skip path, but Open still initializes Engine.applied and the RawNode applied floor from LocalSnap.Metadata.Index (S). Consequently Raft emits the committed entries S+1..T again and applyCommitted replays them against the already-advanced Pebble state, which can produce OCC failures or repeat other non-idempotent FSM effects. The new have >= lastWalIndex gate is fresh evidence beyond the earlier review: it guarantees that every post-snapshot WAL entry is already represented in the FSM without adding any corresponding mechanism to suppress its replay.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T06:37:42Z

+	case coldStartSkip:
+		if obs != nil {
+			obs.RestoreSkipped(target, have)


Report snapshot-relative gaps to the observer

Pass snapIndex here rather than target: ColdStartObserver.RestoreSkipped defines its first argument as the snapshot index, and the Prometheus implementation calculates have - snapIndex, but this call supplies lastWalIndex. Whenever the snapshot trails the WAL, the exported gauge therefore reports distance from the WAL tail (often zero) while its help text claims distance from the snapshot pointer; the executed callback below has the same argument mismatch.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

internal/raftengine/etcd/wal_store_skip_gate_test.go (3)

85-95: ⚡ Quick win

Misleading parameter names in test double.

The recordingObs methods name the first parameter snapIndex, but it receives target (lastWalIndex) from the actual implementation (see reportColdStart in wal_store.go:354-386). This is confusing because target and snapIndex can differ—when the WAL carries post-snapshot entries, target > snapIndex. The gap calculations happen to work because the tests pass snap.Metadata.Index as lastWalIndex, making them equal, but the misleading names make the test harder to maintain.

♻️ Rename parameters to match the interface contract

-func (o *recordingObs) RestoreSkipped(snapIndex, have uint64) {
+func (o *recordingObs) RestoreSkipped(target, have uint64) {
 	o.mu.Lock()
 	defer o.mu.Unlock()
-	o.skipped = append(o.skipped, have-snapIndex)
+	o.skipped = append(o.skipped, have-target)
 }
 
-func (o *recordingObs) RestoreExecuted(snapIndex, have uint64) {
+func (o *recordingObs) RestoreExecuted(target, have uint64) {
 	o.mu.Lock()
 	defer o.mu.Unlock()
-	o.executed = append(o.executed, snapIndex-have)
+	o.executed = append(o.executed, target-have)
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/wal_store_skip_gate_test.go` around lines 85 - 95,
Rename the parameters in the recordingObs test double so they reflect the
interface contract: change the first parameter name from snapIndex to target (or
lastWalIndex) in both RestoreSkipped and RestoreExecuted to match how
reportColdStart in wal_store.go calls them; update any internal calculations or
variable names in RestoreSkipped and RestoreExecuted to use target and snapIndex
appropriately (e.g., compute gaps as have - target or target - have only where
semantically correct) so the test double’s signature and names align with the
real invocation and avoid confusion.

117-119: 💤 Low value

Extract repeated header length constant.

The header length 16 appears in multiple tests (here and line 327) and in the test double (line 59). Consider extracting it as a package constant for maintainability.

♻️ Extract constant

Add near the top of the file:

const (
	testV1HeaderLen = 16 // EKVTHLC1 (8 bytes) + ceiling BE uint64 (8 bytes)
)

Then replace the literals:

-	payload := make([]byte, 16)
+	payload := make([]byte, testV1HeaderLen)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/wal_store_skip_gate_test.go` around lines 117 - 119,
Extract the repeated header length literal by adding a package-level constant
(e.g., testV1HeaderLen = 16) near the top of
internal/raftengine/etcd/wal_store_skip_gate_test.go and replace all literal
uses (for example the make([]byte, 16) in the payload construction, any
copy/slice ops assuming 16 bytes, and the test double referenced around the
earlier helper) with that constant; update references such as the payload
creation (payload := make([]byte, 16)), any index/slice bounds that assume the
16-byte header, and the test double setup to use testV1HeaderLen so the header
size is maintained in one place.

293-315: 💤 Low value

Consider adding clarifying comment for intentional corruption.

The bit flip at line 307 is intentional test corruption, which is clear from context, but a brief comment would make the intent immediately obvious to future readers.

📝 Add clarifying comment

 	var b [1]byte
 	_, err = f.ReadAt(b[:], 0)
 	require.NoError(t, err)
+	// Flip a bit to simulate corruption
 	b[0] ^= 0x01
 	_, err = f.WriteAt(b[:], 0)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/wal_store_skip_gate_test.go` around lines 293 - 315,
Add a short clarifying comment immediately above the intentional bit-flip (the
line with b[0] ^= 0x01) in TestApplyHeaderStateOnSkip_BodyCorruption to state
that the mutation is deliberate test corruption to ensure applyHeaderStateOnSkip
detects CRC mismatch; reference the variables crc, path and the expectation
ErrFSMSnapshotFileCRC and note that FSM state must not change
(fsm.restoredHeader) so reviewers understand the intent.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/raftengine/etcd/wal_store.go`:
- Around line 352-369: In reportColdStart, the ColdStartObserver callbacks are
called with the wrong first argument (passing target/lastWalIndex); change the
calls in reportColdStart to pass snapIndex as the first parameter so
RestoreSkipped and RestoreExecuted are invoked as (snapshotIndex,
haveAppliedIndex) — update the obs.RestoreSkipped(...) and
obs.RestoreExecuted(...) invocations in the reportColdStart function
accordingly.
- Around line 290-295: The current code calls reportColdStart before performing
the actual restore/skip, which records metrics even if applyHeaderStateOnSkip or
openAndRestoreFSMSnapshot fail; move the reportColdStart call so it runs only
after the chosen path successfully completes: call decideSkipOutcome(fsm,
lastWalIndex) to get decision/have, then perform either
applyHeaderStateOnSkip(fsm, fsmSnapPath(...), tok.CRC32C) or
openAndRestoreFSMSnapshot(...), check their returned error is nil, and only then
invoke reportColdStart(obs, logger, decision, tok.Index, lastWalIndex, have);
ensure both branches (skip and restore) return/report only on success and
propagate errors unchanged.

---

Nitpick comments:
In `@internal/raftengine/etcd/wal_store_skip_gate_test.go`:
- Around line 85-95: Rename the parameters in the recordingObs test double so
they reflect the interface contract: change the first parameter name from
snapIndex to target (or lastWalIndex) in both RestoreSkipped and RestoreExecuted
to match how reportColdStart in wal_store.go calls them; update any internal
calculations or variable names in RestoreSkipped and RestoreExecuted to use
target and snapIndex appropriately (e.g., compute gaps as have - target or
target - have only where semantically correct) so the test double’s signature
and names align with the real invocation and avoid confusion.
- Around line 117-119: Extract the repeated header length literal by adding a
package-level constant (e.g., testV1HeaderLen = 16) near the top of
internal/raftengine/etcd/wal_store_skip_gate_test.go and replace all literal
uses (for example the make([]byte, 16) in the payload construction, any
copy/slice ops assuming 16 bytes, and the test double referenced around the
earlier helper) with that constant; update references such as the payload
creation (payload := make([]byte, 16)), any index/slice bounds that assume the
16-byte header, and the test double setup to use testV1HeaderLen so the header
size is maintained in one place.
- Around line 293-315: Add a short clarifying comment immediately above the
intentional bit-flip (the line with b[0] ^= 0x01) in
TestApplyHeaderStateOnSkip_BodyCorruption to state that the mutation is
deliberate test corruption to ensure applyHeaderStateOnSkip detects CRC
mismatch; reference the variables crc, path and the expectation
ErrFSMSnapshotFileCRC and note that FSM state must not change
(fsm.restoredHeader) so reviewers understand the intent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ddfbe851-5668-47b2-a0e9-bfcd9c549572

📥 Commits

Reviewing files that changed from the base of the PR and between d23cf69 and 57467ed.

📒 Files selected for processing (16)

internal/raftengine/cold_start.go
internal/raftengine/etcd/engine.go
internal/raftengine/etcd/factory.go
internal/raftengine/etcd/grpc_transport_test.go
internal/raftengine/etcd/wal_store.go
internal/raftengine/etcd/wal_store_skip_gate_test.go
internal/raftengine/etcd/wal_store_test.go
internal/raftengine/statemachine.go
kv/fsm.go
kv/fsm_applied_index_iface_check.go
main.go
main_bootstrap_e2e_test.go
main_bootstrap_test.go
monitoring/cold_start.go
monitoring/registry.go
multiraft_runtime_test.go

…bit Major #934) Four findings on review 4434778631 + the coderabbit review on the same head. Round-2 fix for the codex P1 plus the three Major / P2 items collected: 1. Codex P1: skip path didn't advance Raft's applied floor. Even with the lastWalIndex gate (which guarantees the FSM has applied through the WAL tail), Engine.Open still seeded e.applied from maxAppliedIndex(LocalSnap) = snapshot.Metadata.Index. Raft then delivered entries snap.Index+1..have to applyCommitted and replayed them onto a Pebble store already containing them - OCC conflicts in the kvFSM commit path; HLC ceiling inversion. Fix: restoreSnapshotState now returns (effectiveApplied uint64, error). diskState carries EffectiveApplied. Engine.Open seeds e.applied and engine.appliedIndex via coldStartApplied(disk) = max(maxAppliedIndex(LocalSnap), EffectiveApplied). Belt+braces: applyNormalCommitted skips entries with Index <= e.applied (idempotent no-op); setApplied is now advance-only so conf changes between snap.Index+1 and have do not regress the counter. 2. Codex P2 / coderabbit Major (observer args): the ColdStartObserver contract is (snapshotIndex, haveAppliedIndex) - monitoring/cold_start.go computes have - snapIndex for the gauge. Round-1 passed target=lastWalIndex instead, breaking the metric baseline. Revert to passing snapIndex. The 'target' / lastWalIndex is preserved as a structured-log field for operator context. 3. Coderabbit Major (emit-after-success): reportColdStart used to fire BEFORE applyHeaderStateOnSkip / openAndRestoreFSMSnapshot, so a CRC/header failure would still register a 'successful' outcome. Moved to after the restore-side work succeeds. Caller audit (signature + semantic): - restoreSnapshotState: now returns (uint64, error). Single production caller (loadWalState) plumbs into diskState.EffectiveApplied. 14 test call sites updated. - setApplied: advance-only. Callers are all in the apply loop (applyNormalCommitted, applyConfChangeCommitted, applyConfChangeV2Committed, applyCommitted's default case). The new gate prevents regression from conf changes between snap.Index+1 and have, which is exactly the case the codex P1 fix relies on. - coldStartApplied: new helper, called from Engine.Open only. Extracted restoreSnapshotStateFromToken to keep restoreSnapshotState under the nestif limit. New regression tests in wal_store_skip_gate_test.go: - TestSkipGate_ReturnsEffectiveAppliedOnSkip: skip path returns have, not snap.Metadata.Index. - TestSkipGate_EmitsAfterSuccess: CRC mismatch does NOT register obs.skipped / obs.executed (metrics emit only on success). go test + golangci-lint clean on ./internal/raftengine/etcd/.

github-actions · 2026-06-07T09:03:33Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (c9b7c3b):

kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

bootjp · 2026-06-07T09:03:34Z

@claude review

Addresses all 4 findings on review 4434778631 + the coderabbit review on the same head:

Codex P1 (wal_store.go:293) — advance Raft's applied floor on skip. Round-1 fix gated skip on have >= lastWalIndex but Engine.Open still seeded e.applied from maxAppliedIndex(LocalSnap) = snap.Index. Raft would still deliver entries snap.Index+1..have to applyCommitted and replay them onto a Pebble store already containing them.

Fix: restoreSnapshotState now returns (effectiveApplied uint64, error). diskState carries EffectiveApplied. Engine.Open uses coldStartApplied(disk) = max(maxAppliedIndex(LocalSnap), EffectiveApplied) to seed e.applied and the atomic mirror. Belt+braces: applyNormalCommitted skips entries with Index <= e.applied (idempotent no-op); setApplied is now advance-only so conf changes in [snap.Index+1, have] don't regress the counter.
Codex P2 / coderabbit Major (observer args). ColdStartObserver contract is (snapshotIndex, haveAppliedIndex) — Prometheus computes have - snapIndex. Round-1 passed target=lastWalIndex instead. Reverted; lastWalIndex stays as a structured-log field.
Coderabbit Major (emit-after-success). reportColdStart moved to after applyHeaderStateOnSkip / openAndRestoreFSMSnapshot succeeds — CRC/header failures no longer register a 'successful' outcome.

Caller audit: restoreSnapshotState signature change (14 test sites updated). setApplied advance-only audited against all 4 apply paths. coldStartApplied new helper, single caller.

Regression: TestSkipGate_ReturnsEffectiveAppliedOnSkip (skip returns have, not snap.Index) + TestSkipGate_EmitsAfterSuccess (CRC failure does not record skipped/executed metrics).

go test + golangci-lint clean on ./internal/raftengine/etcd/.

chatgpt-codex-connector · 2026-06-07T09:03:38Z

To use Codex here, create a Codex account and connect to github.

…e (codex P2 #934 round 3) Two findings on review 4434778631's c9b7c3b head: 1. Codex P2 (wal_store.go:160) — gate on the committed index. The round-2 fix used entries[n-1].Index (the WAL tail) as the skip threshold, but a follower's WAL can carry an UNCOMMITTED suffix that raft will not surface via Ready.CommittedEntries until the leader confirms it. If the FSM is applied through the committed index but trails that suffix, the gate forced a multi-GiB restore on every restart of any such follower, defeating the cold-start optimization. Fix: extract coldStartSkipThreshold(snapshot, hardState) = max(snapshot.Metadata.Index, hardState.Commit). Drops the walTail term — raft's invariant guarantees Commit ≤ walTail, so the walTail comparison was redundant for safety and harmful for the follower-with-uncommitted-suffix optimization. 2. Codex P2 (monitoring/cold_start.go:80) — avoid underflowing the executed gauge. After the round-1 gate change, an executed restore no longer implies have < snapIndex: a crashed follower can have FSM ahead of the snapshot pointer but behind the committed tail. The previous snapIndex - have subtraction underflowed into ~2^64, corrupting the gauge. Fix: ColdStartObserver.RestoreExecuted computes the absolute distance |snapIndex - have|. Comment updated to explain the new semantics (executed no longer implies stale FSM in the literal sense; it means full restore ran because gate did not skip). Caller audit: - coldStartSkipThreshold: pure helper, called once from loadWalState. No production-side downstream calls. - ColdStartObserver.RestoreExecuted: signature unchanged; behavior change is just the absolute-value computation. Callers (engine cold-start path) pass the same arguments. Regression tests: - TestColdStartSkipThreshold (wal_store_skip_gate_test.go): 4-case table — empty WAL, hardState below snapshot (pre-replay), hardState above snapshot, hardState zero. - TestColdStartObserver_ExecutedAbsoluteGap (new monitoring/cold_start_test.go): 3-case table — FSM behind / FSM ahead / FSM equal. - TestColdStartObserver_SkippedGap: skipped-path baseline. - TestColdStartObserver_NilSafe: nil observer is a no-op. - TestColdStartObserver_FallbackReasonLabel: 3-reason table covering not_reader / missing_meta / read_err. go test + golangci-lint clean on ./monitoring/ + ./internal/raftengine/etcd/.

github-actions · 2026-06-07T09:14:40Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (1b166a5):

kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

bootjp · 2026-06-07T09:14:42Z

@claude review

Addresses codex P2 #934 round 3 (review 4434778631 → 2 new inline findings):

wal_store.go:160 — gate on the committed index. Round-2 used entries[n-1].Index (WAL tail) as the skip threshold. A follower's WAL can carry an UNCOMMITTED suffix that raft will not deliver via Ready.CommittedEntries until the leader confirms it. Followers with such suffixes were forced into a full restore on every restart.

Fix: new coldStartSkipThreshold(snapshot, hardState) = max(snapshot.Metadata.Index, hardState.Commit). Raft's invariant guarantees Commit ≤ walTail, so the walTail comparison was redundant for safety and harmful for the follower optimization.
monitoring/cold_start.go:80 — avoid underflowing the executed gauge. Round-1's gate change means executed restores no longer imply have < snapIndex (crashed follower can have FSM ahead of snapshot but behind committed tail). The snapIndex - have subtraction underflowed into ~2^64.

Fix: RestoreExecuted computes |snapIndex - have| (absolute distance). Comment updated.

Regression tests:

TestColdStartSkipThreshold — 4-case table for helper.
TestColdStartObserver_ExecutedAbsoluteGap — 3-case table covering FSM behind / FSM ahead / FSM equal.
Plus TestColdStartObserver_SkippedGap, _NilSafe, _FallbackReasonLabel for monitoring coverage.

chatgpt-codex-connector · 2026-06-07T09:14:48Z

To use Codex here, create a Codex account and connect to github.

claude · 2026-06-07T09:14:53Z

Claude finished @bootjp's task in 3m 2s —— View job

PR #934 Round-5 Review — Committed-index gate + executed-gauge underflow fix (commit `1b166a5`)

Todo list:

Overall assessment

Both fixes are correct. The coldStartSkipThreshold helper picks the right comparison basis (max(snapshot.Metadata.Index, hardState.Commit)) and the RestoreExecuted absolute-gap fix closes a genuine uint64 underflow. Three findings below.

Finding 1 — `recordingObs.RestoreExecuted` uses `snapIndex - have` (no abs), while the production impl now uses `|snapIndex - have|` — test assertions are technically undefined behaviour

Location: internal/raftengine/etcd/wal_store_skip_gate_test.go:91-95, 165

// test double (line 91-95)
func (o *recordingObs) RestoreExecuted(snapIndex, have uint64) {
    o.executed = append(o.executed, snapIndex-have)  // can underflow
}

// assertion (line 165): snapIndex=100, appliedIdx=150
require.Equal(t, []uint64{snapIndex - appliedIdx}, obs.executed)
// both sides compute 100-150 = 18446744073709551466 (wraps equally → test passes)

The test passes only because both sides compute the same wrapped value, not because the semantically correct gap of 50 was verified. The monitoring.ColdStartObserver.RestoreExecuted was fixed to use |snapIndex - have|, but recordingObs was not updated. A future test adding assert.Equal(t, uint64(50), obs.executed[0]) for the FSM-ahead-of-snapshot scenario would fail unexpectedly.

Fix this →

Finding 2 — `gap_ahead` log field uses `have - target` (committed tail) while the metric uses `have - snapIndex` — operators see different baselines

Location: internal/raftengine/etcd/wal_store.go:444

case coldStartSkip:
    if obs != nil {
        obs.RestoreSkipped(snapIndex, have)  // metric gap = have - snapIndex
    }
    if logger != nil {
        logger.Info("restoreSnapshotState skipped",
            zap.Uint64("gap_ahead", have-target),  // log gap = have - hardState.Commit

When hardState.Commit > snapshot.Metadata.Index (post-replay nodes), these two values differ. An operator correlating the log against the Prometheus metric will see different magnitudes. The metric is per-design (snapshot-relative gap); the log would be more consistent if it also logged have-snapIndex as gap_ahead and kept target as a separate last_committed_index field.

This is a minor observability concern, not a correctness issue. Consider either aligning the log to have-snapIndex, or explicitly naming the fields to clarify both:

zap.Uint64("gap_ahead_snapshot",   have-snapIndex),   // what the metric measures
zap.Uint64("gap_ahead_committed",  have-target),      // how far past the WAL committed tail

Finding 3 — `coldStartSkipThreshold` `Commit < snapIndex` case: defensive but realistic in one crash window (no action required)

Location: internal/raftengine/etcd/wal_store.go:211-216 + test case "hardState.Commit below snapshot (pre-replay)"

The test covers hardState.Commit = 50, snapshot.Metadata.Index = 100 → threshold = 100. In practice raft guarantees Commit >= snapshot.Index for any persisted snapshot, but the guard still makes the function safe against a crash window where the snapshot was flushed before the HardState update reached fsync. Returning snapIndex is strictly safer than returning Commit = 50 (would incorrectly allow skip even though raft hasn't committed those entries). Correct as written. ✓

`setApplied` advance-only + conf-change re-delivery — correct ✓

applyNormalCommitted has the entry.Index <= e.applied early-return guard, preventing re-delivery of [snap.Index+1, have] normal entries. Conf changes in the same range go through applyConfChangeCommitted without the guard — re-running ApplyConfChange / persistConfigState / applyConfigChange is idempotent (membership updates, Pebble config-state write). setApplied advance-only guard prevents the counter from regressing when the re-delivered conf-change entry calls setApplied(old_index). Design-correct.

TLA+ spec — no divergence ✓

kv/fsm.go anchor: changes in this commit are confined to coldStartSkipThreshold in wal_store.go, RestoreExecuted in monitoring/cold_start.go, and test files. No HLC issuance path, no OCC commit-ts logic, no MVCC visibility, no route-catalog wiring touched. HLC-4 preconditions (i)(ii)(iii) unaffected. No spec update required.

Round summary

Round	Finding	Severity	Status
R1-F1	Header-error-before-CRC order doc	Minor	Open (doc-only)
R1-F2	`have == want` boundary test	Minor	Superseded by `TestColdStartSkipThreshold` + committed-index tests
R3-F1	Observer docstrings stale (`snapIndex` → WAL tail)	Minor	Closed — args reverted to `snapIndex` ✓
R3-F2	Skip path with non-empty WAL untested	Minor	Closed — `TestColdStartSkipThreshold` + restructured gate covers it ✓
R5-F1	`recordingObs.RestoreExecuted` underflow, hides correct abs-gap assertion	Minor	Open
R5-F2	`gap_ahead` log uses committed-tail baseline, metric uses snapshot baseline	Minor	Open
Codex P2 #1	Gate on `hardState.Commit` not WAL tail	—	✓ Correct
Codex P2 #2	`RestoreExecuted` underflow in gauge	—	✓ Correct
TLA+ spec	No divergence	—	✓ None

The two remaining open items are minor quality issues; neither affects production correctness or the soak-plan metrics.

… (claude #934 round 5) Two findings on the 1b166a5 head: 1. recordingObs.RestoreExecuted (test double) used snapIndex - have, which underflows when have > snapIndex - the exact case the round-3 production fix handled. Tests passed only because both sides wrapped to the same value. Update the test double to mirror monitoring.ColdStartObserver's |snapIndex - have| and tighten TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries from a length-only assertion to the absolute gap (50). 2. The 'gap_ahead' / 'gap_behind' log fields used target (committed tail) as the baseline while the Prometheus gauges use snapIndex. Operators correlating the structured log against the metric saw different magnitudes whenever hardState.Commit > snap.Index (the post-replay normal case). Rename the log fields to be explicit: - last_committed_index (was last_wal_index; the helper now returns max(snap.Index, hardState.Commit), so the new name is more accurate) - gap_ahead_snapshot (skipped path; mirrors metric) - gap_ahead_committed (skipped path; new committed-tail relative) - gap_to_snapshot (executed path; absolute value, no underflow) - gap_behind_committed (executed path; matches gap_behind name semantics) Extracted the executed arm into reportColdStartExecute so reportColdStart stays under the cyclop limit (the abs-value computation pushes the inline branch over). Caller audit: reportColdStartExecute is private to wal_store.go, single call site from reportColdStart. recordingObs.RestoreExecuted is the test double; assertion update was straightforward (1 line). Tests + lint clean on ./internal/raftengine/etcd/ + ./monitoring/.

bootjp · 2026-06-07T09:23:17Z

@claude review

Addresses both round-5 findings:

Finding 1 — recordingObs test double drift. recordingObs.RestoreExecuted updated to mirror monitoring.ColdStartObserver's absolute-value computation (|snapIndex - have|). Tightened TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries to assert the semantically correct gap (50) instead of a length-only check.
Finding 2 — log/metric baseline drift. Renamed log fields to disambiguate the two baselines:
- Skipped: gap_ahead_snapshot (= metric baseline) + gap_ahead_committed (= committed-tail relative).
- Executed: gap_to_snapshot (absolute, no underflow) + gap_behind_committed.
- last_wal_index → last_committed_index since the helper now returns max(snap.Index, hardState.Commit).
Extracted the executed arm into reportColdStartExecute to keep reportColdStart under cyclop.

Finding 3 (defensive Commit < snapIndex guard) — confirmed correct as-is by claude; no change.

go test + golangci-lint clean on ./internal/raftengine/etcd/ + ./monitoring/.

github-actions · 2026-06-07T09:23:20Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (d9bf745):

kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

chatgpt-codex-connector · 2026-06-07T09:23:27Z

To use Codex here, create a Codex account and connect to github.

claude · 2026-06-07T09:23:31Z

Claude finished @bootjp's task in 1m 55s —— View job

PR #934 Round-6 Review — recordingObs fix + log/metric baseline disambiguation (commit `d9bf745`)

Todo list:

Load tools
Read wal_store.go — renamed log fields + reportColdStartExecute extraction
Read wal_store_skip_gate_test.go — recordingObs.RestoreExecuted abs-value fix + tightened gap assertion
Read monitoring/cold_start.go + monitoring/cold_start_test.go
Verify Finding 1 (abs-value test double) is correctly fixed
Verify Finding 2 (log/metric baseline disambiguation) is correctly done
Check cyclop compliance of extracted reportColdStartExecute
TLA+ spec: kv/fsm.go anchor — no new divergence
Post final review

Overall assessment

Both round-5 findings are correctly and completely closed. No new issues introduced. The implementation is clean.

R5-F1 — `recordingObs.RestoreExecuted` drift — Closed correctly ✓

The test double now mirrors monitoring.ColdStartObserver exactly:

// wal_store_skip_gate_test.go:91-106
func (o *recordingObs) RestoreExecuted(snapIndex, have uint64) {
    var gap uint64
    if have >= snapIndex {
        gap = have - snapIndex
    } else {
        gap = snapIndex - have
    }
    o.executed = append(o.executed, gap)
}

The tightened assertion in TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries (appliedIdx=150, snapIndex=100):

require.Equal(t, []uint64{appliedIdx - snapIndex}, obs.executed,
    "observer MUST record the absolute snapshot-relative gap")

Resolves to []uint64{50} — the semantically correct absolute gap, not []uint64{18446744073709551566} (underflow). The existing TestSkipGate_ExecutesWhenFSMStale (snapIndex=200, appliedIdx=100) continues to assert []uint64{snapIndex - appliedIdx} = []uint64{100}, which still equals the absolute gap since snapIndex > have in that case. ✓

R5-F2 — Log/metric baseline drift — Closed correctly ✓

The two-field disambiguation in reportColdStart (skip arm, wal_store.go:477-483):

zap.Uint64("gap_ahead_snapshot", have-snapIndex),   // metric baseline: have >= snapIndex guaranteed by gate
zap.Uint64("gap_ahead_committed", have-target),     // target = max(snap.Index, hardState.Commit)

And in the extracted reportColdStartExecute (wal_store.go:214-219):

zap.Uint64("gap_to_snapshot", gapToSnapshot),        // |snapIndex - have|, cannot underflow
zap.Uint64("gap_behind_committed", target-have),     // target > have guaranteed by gate condition
zap.Uint64("last_committed_index", target),

Both the comment ("// Two named gap fields so an operator correlating the log against the Prometheus gauge sees consistent magnitudes") and the field naming give operators an unambiguous correlation path to monitoring.ColdStartObserver. ✓

Underflow safety of log fields verified:

have - snapIndex (skip): skip gate requires have >= target >= snapIndex → no underflow. ✓
have - target (skip): skip gate requires have >= target → no underflow. ✓
target - have (execute): execute arm requires have < target → no underflow. ✓
gapToSnapshot (execute): explicit abs-value conditional → safe for all orderings. ✓

`reportColdStartExecute` extraction — Cyclop justified ✓

The extraction is the minimal change to keep reportColdStart under the cyclop limit. The abs-value conditional in the execute arm (if have >= snapIndex) adds one branch that would push the parent over. The split is clean: reportColdStartExecute has a single responsibility (log the execute arm) and no callers other than reportColdStart. ✓

TLA+ spec — No divergence ✓

kv/fsm.go anchor: this commit touches only wal_store.go (log-field renames + reportColdStartExecute extraction) and wal_store_skip_gate_test.go/monitoring/cold_start_test.go. No HLC issuance path, OCC commit-ts, MVCC visibility, or routing-catalog logic is touched. HLC-4 preconditions (i)(ii)(iii) unaffected. No spec update required.

One minor note (no action required)

TestSkipGate_SkipsWhenFSMFreshEnough and TestSkipGate_ReturnsEffectiveAppliedOnSkip both call restoreSnapshotState with lastWalIndex = snap.Metadata.Index (empty WAL tail). The skip path with non-empty WAL (have = lastWalIndex > snap.Metadata.Index) is still not tested directly — this was R3-F2 (open, nice-to-have) and remains so. Not a blocker; the gate logic is well-covered by TestColdStartSkipThreshold and TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries, and the anti-double-apply mechanism (setApplied advance-only + applyNormalCommitted skip) is unit-tested independently.

Round summary

Finding	Severity	Status
R1-F1 — header-error-before-CRC ordering (doc-only)	Minor	Open (doc comment; no safety impact)
R3-F2 — skip path with non-empty WAL untested	Minor	Open (nice-to-have)
R5-F1 — `recordingObs.RestoreExecuted` underflow, masks correct abs-gap assertion	Minor	✅ Closed (`d9bf745`)
R5-F2 — `gap_ahead` log uses committed-tail baseline, metric uses snapshot baseline	Minor	✅ Closed (`d9bf745`)
TLA+ spec divergence	n/a	✅ None

The implementation is correct and complete. All major and minor actionable findings across six review rounds are resolved. The two remaining open items are documentation-only (R1-F1) and test-coverage quality (R3-F2); neither affects production correctness, safety, or the soak-plan observability signals. Ready to merge.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9bf745f96

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T09:25:50Z

+	if entry.Index <= e.applied {
+		return nil


Preserve post-snapshot HLC lease effects when skipping replay

When the committed tail contains an HLC lease entry after the snapshot followed by a KV mutation that advances metaAppliedIndex, the skip gate seeds e.applied at the tail and this guard drops both entries. ApplySnapshotHeader restores only the older snapshot-time ceiling, so the post-snapshot lease's in-memory SetPhysicalCeiling effect is never reconstructed. If that lease ceiling is still ahead of wall time, the restarted node can retain persisted timestamps above its HLC ceiling and reject fenced timestamp issuance or later issue timestamps from a stale ceiling; duplicate data mutations must be suppressed without suppressing volatile-only entries such as HLC leases.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T09:25:50Z

+		appliedIndexGap: prometheus.NewGaugeVec(prometheus.GaugeOpts{
+			Name: "elastickv_fsm_cold_start_applied_index_gap",
+			Help: "Absolute distance between the snapshot pointer and the FSM's reported applied index at cold start, in Raft log entries. outcome=skipped reports how far ahead the live store was; outcome=executed reports how far behind it was.",
+		}, []string{"outcome"}),


Partition cold-start metrics by Raft group

In multi-group mode, buildShardGroups opens several engines through one factory carrying the same registry observer, but this gauge is labeled only by outcome. Each shard therefore writes the same time series during startup, and the final value is merely whichever group's callback ran last rather than the gap for an identifiable FSM; the counters likewise aggregate multiple gate decisions despite the documentation describing at most one event per process. Include the Raft group in the observer or metric labels so multi-shard alerts and diagnostics remain meaningful.

Useful? React with 👍 / 👎.

… P1 #934 round 7) When the cold-start skip gate fired the engine seeded `e.applied` at the WAL committed tail and the duplicate guard (`entry.Index <= e.applied { return nil }`) then dropped every replayed entry — including HLC lease entries (raftEncodeHLCLease, tag 0x02). HLC leases have purely in-memory effects (`HLC.SetPhysicalCeiling`), so their post-snapshot ceiling raise was lost on every cold start. `ApplySnapshotHeader` then restored only the older snapshot-time ceiling and the next leader-issued fenced timestamp could collide with persisted commit_ts values stamped under the lost lease. Fix: - Add raftengine.VolatileEntryClassifier — an optional FSM seam that classifies cleartext FSM payloads as volatile-only or data-mutating. - kvFSM.IsVolatileOnlyPayload returns true iff payload[0] == raftEncodeHLCLease. Static iface assertion in fsm_applied_index_iface_check.go. - applyNormalEntry takes a new dropIfNonVolatile flag. On the cold-start duplicate path the engine still decodes the envelope and decrypts (post-cutover HLC leases are encrypted) so the cleartext tag is visible, then either: * passes the payload to fsm.Apply (volatile — replayed for its in-memory monotonic effect, NO setApplied, NO resolveProposal); or * returns (nil, nil) (data-mutating — dropped, preserving the OCC / sidecar idempotency from PR #934 round 1). - applyNormalCommitted forwards `duplicate` to applyNormalEntry and only advances setApplied + resolveProposal on the fresh path. Test: - cold_start_volatile_replay_test.go pins three contracts: (1) duplicate HLC lease → fsm.Apply called, e.applied unchanged; (2) duplicate KV/MVCC → fsm.Apply NOT called, e.applied unchanged; (3) FSM that does not implement the classifier → drop-all (no opt-in regression). - Also pins the fresh-entry asymmetry: every tag reaches fsm.Apply when entry.Index > e.applied. Self-review: 1. Data loss: HLC lease re-apply is monotonic (SetPhysicalCeiling) — no regression, no double-write. The guard for data-mutating entries is unchanged. 2. Concurrency: classifier called inside the existing applyNormalEntry path on the same goroutine — no new locks. 3. Performance: cold-start replay only; one extra envelope decode + (post-cutover) decryption per duplicate. Bounded by WAL tail length; not on a steady-state hot path. 4. Consistency: classifier sees the cleartext payload (post- decrypt), so post-cutover HLC leases are still recognised. Encryption opcodes (0x03..0x07) correctly classified as data- mutating — re-apply would diverge the sidecar RaftAppliedIndex. 5. Test coverage: three table-driven cases pinning every arm (volatile-replay, data-drop, classifier-absent), plus the fresh-entry asymmetry table.

github-actions · 2026-06-07T10:06:22Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (20f90a9):

kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

bootjp · 2026-06-07T10:06:31Z

@claude review

Round 7 addresses codex P1: post-snapshot HLC lease effects were lost when the cold-start skip gate fired.

Root cause. The Round-1 idempotency guard
(if entry.Index <= e.applied { return nil }) treated every duplicate uniformly,
but HLC lease entries (raftEncodeHLCLease, tag 0x02) carry purely in-memory
effects (HLC.SetPhysicalCeiling). After the skip gate seeded e.applied at the
WAL committed tail, the guard dropped all post-snapshot lease entries.
ApplySnapshotHeader restored only the older snapshot-time ceiling, so the next
leader-issued fenced timestamp could collide with persisted commit_ts values
stamped under the lost lease.

Fix.

New optional FSM seam raftengine.VolatileEntryClassifier —
classifies cleartext FSM payloads as volatile-only or data-mutating.
kvFSM.IsVolatileOnlyPayload returns true iff payload[0] == raftEncodeHLCLease.
Static iface assertion pins the contract.
applyNormalEntry now takes a dropIfNonVolatile flag. On the cold-start
duplicate path the engine still decodes the envelope and decrypts
(post-cutover HLC leases are encrypted), then either passes the payload to
fsm.Apply (volatile — monotonic in-memory effect, no setApplied, no
resolveProposal) or returns (nil, nil) (data-mutating — dropped, preserving
the Round-1 OCC / sidecar idempotency).
applyNormalCommitted forwards duplicate to applyNormalEntry and only
advances setApplied + resolveProposal on the fresh path.

Test. cold_start_volatile_replay_test.go pins three contracts:

duplicate HLC lease → fsm.Apply called, e.applied unchanged
duplicate KV/MVCC → fsm.Apply NOT called, e.applied unchanged (Round-1 regression locked down)
FSM that doesn't implement the classifier → drop-all (no opt-in regression)

Plus the fresh-entry asymmetry: every tag still reaches fsm.Apply when
entry.Index > e.applied.

Caller audit. applyNormalEntry signature widened; every existing call site
(applyNormalCommitted + 10 encryption_test.go sites) passes false,
preserving the pre-PR semantics.

Open items not addressed in this commit.

codex P2 (factory plumbing for ColdStartObserver) — separate concern
scoped at Factory.Create / OpenConfig; will follow up if you want it
bundled or in a follow-up PR.
codex P2 (per-group metric labels) — same call; meaningful only after the
factory plumbing lands so the registry observer is actually fed.

CI note: previous run hit a 504 on reviewdog install — transient infra, not a
lint failure on the code.

claude · 2026-06-07T10:06:42Z

Claude finished @bootjp's task in 3m 42s —— View job

PR #934 Round-7 Review — Volatile HLC lease replay on cold-start skip (commit `20f90a9`)

Todo list:

Overall assessment

The fix is correct. The root cause analysis is accurate, the seam is well-designed, and the three contracts tested are the right ones. One finding and one minor note below.

Finding 1 — `SetApplyIndex` is called with a stale index for volatile replays (Minor / informational)

Location: internal/raftengine/etcd/engine.go:2407

When applyNormalEntry takes the volatile-replay path (dropIfNonVolatile=true, IsVolatileOnlyPayload=true), it calls SetApplyIndex(entry.Index) before fsm.Apply(payload):

// In applyNormalEntry (line 2407)
if aware, ok := e.fsm.(raftengine.ApplyIndexAware); ok {
    aware.SetApplyIndex(entry.Index)   // ← entry.Index is BELOW e.applied on the duplicate path
}
return e.fsm.Apply(payload), nil

For a duplicate entry at index 150 with e.applied = 200, this sets kvFSM.pendingApplyIdx = 150. Since applyHLCLease doesn't read pendingApplyIdx, this has no observable effect on the HLC ceiling raise — the replay works correctly.

However, it is a semantic inconsistency: pendingApplyIdx is documented as "the Raft entry index the engine is about to apply" (kv/fsm.go:42) and it is consumed by applyEncryption (encryption sidecar RaftAppliedIndex) and ApplyMutationsRaftAt (sidecar sync). Since IsVolatileOnlyPayload must return false for encryption opcodes (enforced by the comment in kv/fsm.go:641-644), those consumers will never run on the volatile path. The risk is zero for the current implementation — but a future volatile entry type that also happens to use pendingApplyIdx for a durability side-effect would silently write a stale index.

The safest fix is to skip SetApplyIndex on the duplicate path entirely. Since volatile Apply (HLC lease) never reads pendingApplyIdx, this produces no behaviour change:

// Only SetApplyIndex on the non-duplicate path
if !dropIfNonVolatile {
    if aware, ok := e.fsm.(raftengine.ApplyIndexAware); ok {
        aware.SetApplyIndex(entry.Index)
    }
}
return e.fsm.Apply(payload), nil

Not a correctness bug today. Worth documenting at minimum.

Consider fixing →

Finding 2 — No test for the encrypted-volatile path (Minor)

Location: internal/raftengine/etcd/cold_start_volatile_replay_test.go

All three tests use newTestEngine(fsm, nil, nil) — no cipher, no cutover, so the decryption branch in applyNormalEntry is never exercised. The production scenario that matters most is:

e.raftCutoverIndex() is active (post-Stage-8a cluster)
A duplicate HLC lease at index 150 is encrypted (wrapped by WrapRaftPayload)
The engine must: decrypt → classify as volatile → replay

The correct ordering is locked in the code (decryption runs before IsVolatileOnlyPayload), but is not tested. A test using envelopeEntry from encryption_test.go with a volatileTagFakeFSM and a real cipher/cutover would close this gap:

// pseudo-sketch
c, kid := raftCipherFixture(t)
cutover := func() uint64 { return 100 }
fsm := &volatileTagFakeFSM{}
e := newTestEngine(fsm, c, cutover)
e.applied = 200

plainPayload := []byte{0x02, ...}   // HLC lease tag
entry := envelopeEntry(t, c, kid, 150, plainPayload)  // encrypted at index 150 (below e.applied=200)

err := e.applyNormalCommitted(entry)
// verify fsm.calls = 1, e.applied = 200

Not a blocker, but the current tests only cover the non-encrypted (pre-cutover) path.

Correctness of the fix — verified ✓

Property	Verification
HLC ceiling monotonicity: replay is idempotent	`applyHLCLease` → `SetPhysicalCeiling` is a monotonic raise; replaying raises ceiling to at least the post-snapshot level ✓
OCC safety: data-mutating duplicates not replayed	`IsVolatileOnlyPayload = false` → `applyNormalEntry` returns `(nil, nil)` before `SetApplyIndex`/`Apply` ✓
`setApplied` not advanced for any duplicate	`applyNormalCommitted:2260` — both volatile and non-volatile duplicates hit `if duplicate { return nil }` before `setApplied` ✓
`resolveProposal` not called for volatile replay	Same arm; no coordinator waiting for HLC lease proposals ✓
`HaltApply` guard runs before duplicate arm	`applyNormalCommitted:2252-2259` — checked on both fresh and duplicate path; HLC lease Apply returns nil so no halt risk ✓
FSMs without classifier get drop-all (no regression)	`!ok → return nil, nil` in the type-assertion guard; `TestApplyNormalCommitted_DuplicateWithoutClassifier_DropsAll` pins it ✓
Post-cutover (encrypted) HLC leases: decryption before classification	`applyNormalEntry` decrypts at line 2370-2391, then classifies at line 2401-2406 ✓
Compile-time guard against interface drift	`var _ raftengine.VolatileEntryClassifier = (*kvFSM)(nil)` in `kv/fsm_applied_index_iface_check.go` ✓

TLA+ spec — no divergence ✓

kv/fsm.go anchor: IsVolatileOnlyPayload is a pure classification (no state change — explicitly documented). applyHLCLease / SetPhysicalCeiling path is unchanged from previous rounds. The fix ensures the ceiling is raised to at least the post-snapshot level, which is the HLC-4 precondition (i) requirement (bounded skew: newly elected leader's ceiling ≥ the largest ceiling stamped on any committed entry). HLC-4 preconditions (ii) and (iii) are unaffected. No OCC commit-ts, MVCC visibility, or routing-catalog logic is touched. No spec update required.

Round summary

Round	Finding	Severity	Status
R1-F1	Header-error-before-CRC ordering (doc-only)	Minor	Open (doc comment)
R4–R6	All major findings (apply-floor advance, observer args, emit-after-success, abs-gap, log baseline)	Major/Minor	✅ Closed
R7-F1	`SetApplyIndex` called with stale index on volatile replay	Minor / informational	Open (no current risk)
R7-F2	Encrypted-volatile path not tested	Minor	Open (nice-to-have)
TLA+ spec divergence	No divergence	—	✅ None
Codex P1 #934 round 7	Post-snapshot HLC lease ceiling lost on skip	—	✅ Correct

The codex P1 root cause is correctly diagnosed and the fix is complete. Ready to merge; the two open items are informational and don't affect production correctness or the soak plan.

chatgpt-codex-connector · 2026-06-07T10:08:28Z

To use Codex here, create a Codex account and connect to github.

…pted-volatile path (claude #934 round 7 F1+F2) R7-F1 (Minor / informational): on the cold-start duplicate-replay path entry.Index is BELOW e.applied, so calling SetApplyIndex would feed a stale index to any future durability sink (encryption sidecar RaftAppliedIndex, ApplyMutationsRaftAt's metaAppliedIndex bundle). Today's only volatile entry (HLC lease) does not read pendingApplyIdx so this was a no-op, but it future-proofs the seam against a new volatile entry type that also persists. SetApplyIndex now only fires on the !dropIfNonVolatile arm. R7-F2 (Minor / nice-to-have): the round-7 test fixtures all used newTestEngine(fsm, nil, nil) — no cipher, no cutover — so the decryption-before-classification ordering in applyNormalEntry was locked in code but not in tests. Added TestApplyNormalCommitted_VolatileDuplicate_PostCutoverEncrypted which wraps an HLC lease via WrapRaftPayload at an above-cutover index that's still BELOW e.applied. Verifies the engine decrypts, classifies as volatile, replays for the in-memory effect, and hands fsm.Apply the cleartext payload byte-for-byte. While extracting cyclop budget, factored the dropIfNonVolatile classifier arm into Engine.isVolatilePayload. Same semantics; cleaner call site. Self-review: 1. Data loss: no — replay path unchanged, SetApplyIndex suppression has no observable effect for HLC lease (the only current volatile type) but prevents future regression. 2. Concurrency: no new locks, no new goroutines. Same serialised apply loop. 3. Performance: one extra method call per cold-start duplicate. Bounded by WAL tail length; not on a steady-state path. 4. Consistency: encrypted-volatile path now pinned by test — post-cutover HLC lease decrypt → classify → replay is locked against future drift. 5. Test coverage: added explicit encrypted-volatile test; existing 3 contracts (volatile-replay / data-drop / classifier-absent) and fresh-entry asymmetry still pass.

github-actions · 2026-06-07T10:15:24Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (fc55c7b):

kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

bootjp · 2026-06-07T10:15:30Z

@claude review

Round 7 F1+F2 follow-up — both Minor findings from your last review are now closed.

R7-F1 — SetApplyIndex stale-index on volatile replay. Moved
SetApplyIndex inside an if !dropIfNonVolatile {…} guard so it only
fires on the fresh-entry arm. Today's only volatile entry (HLC lease)
doesn't read pendingApplyIdx so this is observably a no-op, but it
future-proofs the seam against a new volatile entry type that would
also persist (the seam's documented purpose).

R7-F2 — Encrypted-volatile path untested. Added
TestApplyNormalCommitted_VolatileDuplicate_PostCutoverEncrypted
which uses raftCipherFixture + envelopeEntry to wrap an HLC lease
above the cutover index but BELOW e.applied. Verifies:

fsm.Apply reached once (decrypt → classify → replay ordering),
FSM receives the cleartext payload byte-for-byte (classifier must
see post-decrypt bytes),
e.applied pinned at 200 (no advance on duplicate arm).

Refactor note. Pulled the classifier check into
Engine.isVolatilePayload to stay under the cyclop budget — same
semantics, cleaner call site.

Tests + lint clean locally:

go test -race ./internal/raftengine/etcd/ ./kv/... ok
golangci-lint run ./internal/raftengine/... ./kv/... 0 issues

R1-F1 (header-error-before-CRC ordering doc comment) is the only
remaining open item; you flagged it as doc-only, no safety impact.

chatgpt-codex-connector · 2026-06-07T10:15:32Z

To use Codex here, create a Codex account and connect to github.

claude · 2026-06-07T10:15:43Z

Claude finished @bootjp's task in 2m 35s —— View job

PR #934 Round-8 Review — R7-F1 + R7-F2 follow-up (commit `fc55c7b`)

Todo list:

Load tools
Read engine.go — SetApplyIndex guard + isVolatilePayload refactor
Read cold_start_volatile_replay_test.go — encrypted-volatile test
Verify R7-F1: SetApplyIndex guard correctness
Verify R7-F2: encrypted-volatile test coverage
Review isVolatilePayload extraction
TLA+ spec: kv/fsm.go anchor check
Post final review

Overall assessment

Both R7-F1 and R7-F2 are correctly and completely closed. No new issues. The PR is clean.

R7-F1 — `SetApplyIndex` stale-index on volatile replay — Closed correctly ✓

The guard is correctly placed:

// engine.go:2405-2409
if !dropIfNonVolatile {
    if aware, ok := e.fsm.(raftengine.ApplyIndexAware); ok {
        aware.SetApplyIndex(entry.Index)
    }
}
return e.fsm.Apply(payload), nil

The three code paths through applyNormalEntry after the change:

`dropIfNonVolatile`	`IsVolatileOnlyPayload`	`SetApplyIndex` called?	`fsm.Apply` called?
`false` (fresh entry)	any	✓ yes, at `entry.Index`	✓ yes
`true` (duplicate), data-mutating	`false`	✗ returns `(nil,nil)` early at line 2392	✗
`true` (duplicate), volatile	`true`	✗ guarded by `!dropIfNonVolatile`	✓ yes, with cleartext payload

All three paths are correct. The comment explains the future-proofing rationale clearly. The change is a pure no-op for all currently existing volatile entries (only HLC lease, which reads no pendingApplyIdx) but guards the seam against a future volatile type that also persists.

R7-F2 — Encrypted-volatile path untested — Closed correctly ✓

TestApplyNormalCommitted_VolatileDuplicate_PostCutoverEncrypted correctly exercises the full decrypt → classify → replay ordering:

cipher wired, cutover=100, e.applied=200
plain := []byte{0x02, 0, 0, 0, 0, 0, 0, 0, 0x01}
entry := envelopeEntry(t, c, kid, 150, plain)

Execution trace through applyNormalEntry:

decodeProposalEnvelope → extracts the wrapped (encrypted) payload
entry.Index(150) > cutover(100) → unwrapRaftPayload → payload = plain = [0x02, ...]
dropIfNonVolatile(true) && !isVolatilePayload([0x02,...]) → isVolatilePayload = true → does NOT return early
!dropIfNonVolatile = false → SetApplyIndex suppressed
fsm.Apply([0x02,...]) with cleartext

Three assertions:

fsm.calls = 1 — decryption ran before classification, lease was replayed ✓
fsm.lastPayload == plain — classifier received post-decrypt bytes ✓
e.applied == 200 — not advanced on duplicate arm ✓

This is precisely the scenario R7-F2 flagged. The test is correctly structured and locks in the ordering that protects against future regressions.

`isVolatilePayload` extraction — Clean ✓

The extraction is a pure refactor:

Before: inline type assertion + ok check in applyNormalEntry (would push past cyclop budget with the additional if !dropIfNonVolatile branch)
After: extracted to Engine.isVolatilePayload, identical semantics

The method comment accurately documents the "runs AFTER decryption" contract — critical for anyone reading the method in isolation. The !ok → false default is correctly stated as "preserving pre-PR semantics for FSMs that haven't opted in." ✓

TLA+ spec — No divergence ✓

This commit touches only engine.go and cold_start_volatile_replay_test.go. The kv/fsm.go anchor file (which triggers the TLA+ check on every push) is unchanged from the previous commits. As confirmed in Round-7:

applyHLCLease / SetPhysicalCeiling path: unchanged, HLC-4 preconditions (i)/(ii)/(iii) unaffected
ApplySnapshotHeader: unchanged
No OCC commit-ts, MVCC visibility, or routing-catalog changes

No spec update required.

Complete finding tracker (all rounds)

Round	Finding	Severity	Status
R1-F1	Header-error-before-CRC ordering (doc comment)	Minor	Open (doc-only, no safety impact)
R1-F2 → R3-F2	Skip path coverage completeness	Minor	✅ Superseded by committed-index gate restructure
R3-F1 → R6	Observer docstrings / log field alignment	Minor	✅ Closed
R4	Apply-floor advance on skip (Codex P1)	Critical	✅ Closed
R4	Observer args (Codex P2 / CodeRabbit Major)	Major	✅ Closed
R4	Emit-after-success (CodeRabbit Major)	Major	✅ Closed
R5	Gate on `hardState.Commit` not WAL tail	Major	✅ Closed
R5	`RestoreExecuted` underflow in gauge	Major	✅ Closed
R6	Test double abs-gap drift	Minor	✅ Closed
R7	Post-snapshot HLC lease ceiling lost on skip (Codex P1)	Critical	✅ Closed
R7-F1	`SetApplyIndex` with stale index on volatile replay	Minor	✅ Closed (`fc55c7b`)
R7-F2	Encrypted-volatile path untested	Minor	✅ Closed (`fc55c7b`)
TLA+ spec	No divergence across all rounds	—	✅ None

The sole remaining open item is R1-F1 — a doc comment explaining the header-parse-before-body-CRC ordering in applyHeaderStateOnSkip. You confirmed this as doc-only with no safety impact. If you'd like to add it before merging, the location is wal_store.go at the applyHeaderStateOnSkip function body. Otherwise the PR is ready to merge.

…tateOnSkip (claude #934 R1-F1) R1-F1 was the only open finding across 8 review rounds, marked doc- only with no safety impact. Add a comment in applyHeaderStateOnSkip spelling out why header parse runs before the body-CRC compare: - Both header-parse errors and CRC errors are equally fatal for the skip path (snapshot file corruption) and both must propagate without ever calling ApplySnapshotHeader. - CRC compare stays AFTER the header parse so the TeeReader has actually been drained when h.Sum32() is read; inverting the order would let a CRC mismatch surface on a truncated body even when the header was valid, muddying the operator-facing diagnostic. - Matches the same ordering openAndRestoreFSMSnapshot uses, so fallback paths see consistent error chains regardless of which routine ran. No code change; this is the only safe artifact of an audit comment. Tests pass; lint clean.

bootjp · 2026-06-07T10:22:59Z

@claude review

Final cleanup: R1-F1 doc comment in applyHeaderStateOnSkip explaining the header-error-before-CRC ordering. No code change. Every finding across 8 rounds is now closed.

claude · 2026-06-07T10:23:10Z

Claude finished @bootjp's task in 1m 34s —— View job

PR #934 Round-9 Review — R1-F1 doc comment (final cleanup, commit `58db40c`)

Todo list:

Load tools
Read wal_store.go — locate applyHeaderStateOnSkip and verify the doc comment
Verify the comment accurately explains the header-error-before-CRC ordering
Confirm TLA+ spec still clean
Post final review

R1-F1 — Header-error-before-CRC ordering — Closed correctly ✓

The inline comment at wal_store.go:539-551 is technically accurate and well-reasoned:

// Error-ordering contract (claude #934 R1-F1): header parse
// errors surface BEFORE the body-CRC compare runs, so callers
// (the skip-gate fallback in restoreSnapshotState) may observe
// either an ErrSnapshotHeaderUnknownMagic / InvalidLength chain or
// an ErrFSMSnapshotFileCRC chain depending on which check fails
// first. This is the same ordering openAndRestoreFSMSnapshot has
// — both errors are equally fatal for the skip path (they signal
// snapshot file corruption) and both must propagate without ever
// calling ApplySnapshotHeader. The CRC check stays AFTER the
// header parse so the TeeReader has actually been drained before
// we read h.Sum32(); inverting the order would let a CRC mismatch
// surface on a truncated body even when the header was valid,
// muddying the operator-facing diagnostic.

Three claims verified against the code:

Claim	Code	Verified
Header parse errors surface before body-CRC	`readSnapshotHeaderOrDrain` at line 560 returns before `h.Sum32()` at line 565	✓
TeeReader must be drained before `h.Sum32()` is meaningful	`tee := io.TeeReader(io.LimitReader(file, payloadSize), h)` then drained through `ParseSnapshotHeader`/`io.Copy` before CRC read	✓
Same ordering as `openAndRestoreFSMSnapshot`	`restoreAndComputeCRC` in the full-restore path also runs header parse inside the TeeReader loop before the final CRC verify	✓

The function-level docstring (lines 503-517) is also complete and accurate:

"CRC verification still runs [for FSMs without SnapshotHeaderApplier]" — confirmed: readSnapshotHeaderOrDrain with hasSetter=false calls io.Copy(io.Discard, tee) to drain all payload bytes through h, ensuring every byte is hashed regardless of whether the FSM exposes the seam. ✓
"On any verification failure the typed error propagates and FSM state stays untouched." — ApplySnapshotHeader is gated on hasSetter AND only called after both CRC checks pass. ✓

One very minor note (no action required): The attribution (claude #934 R1-F1) in the inline comment technically violates CLAUDE.md's convention ("Don't reference the current task, fix, or callers … since those belong in the PR description and rot as the codebase evolves"). The technical explanation stands without the attribution tag. It's a negligible style point and doesn't affect correctness; mentioning it only for completeness.

TLA+ spec — No divergence ✓

This commit is documentation-only — no code changed. All HLC-4 preconditions, OCC commit-ts ordering, MVCC visibility invariants, and routing-catalog contracts are unaffected. No spec update required.

Complete finding tracker (all rounds)

Round	Finding	Severity	Status
R1-F1	Header-error-before-CRC ordering (doc comment)	Minor	✅ Closed (`58db40c`)
R1-F2 → R3-F2	Skip path coverage completeness	Minor	✅ Superseded by committed-index gate restructure
R4	Apply-floor advance on skip (Codex P1)	Critical	✅ Closed
R4	Observer args (Codex P2 / CodeRabbit Major)	Major	✅ Closed
R4	Emit-after-success (CodeRabbit Major)	Major	✅ Closed
R5	Gate on `hardState.Commit` not WAL tail	Major	✅ Closed
R5	`RestoreExecuted` underflow in gauge	Major	✅ Closed
R6	Test double abs-gap drift	Minor	✅ Closed
R6	Log/metric baseline drift	Minor	✅ Closed
R7	Post-snapshot HLC lease ceiling lost on skip (Codex P1)	Critical	✅ Closed
R7-F1	`SetApplyIndex` with stale index on volatile replay	Minor	✅ Closed
R7-F2	Encrypted-volatile path untested	Minor	✅ Closed
TLA+ spec	No divergence across all rounds	—	✅ None

All findings across 9 rounds are closed. The implementation is correct, the soak-plan metrics are wired end-to-end, the safety invariants are preserved, and the documentation now fully explains the non-obvious ordering decisions. Ready to merge.

github-actions · 2026-06-07T10:23:15Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (58db40c):

kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

chatgpt-codex-connector · 2026-06-07T10:23:23Z

To use Codex here, create a Codex account and connect to github.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 58db40cd6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T10:25:46Z

 		voterCount:       len(prepared.disk.LocalSnap.Metadata.ConfState.Voters),
 		isLearnerNode:    learnerSetFromConfState(prepared.disk.LocalSnap.Metadata.ConfState),
-		applied:          maxAppliedIndex(prepared.disk.LocalSnap),
+		applied:          coldStartApplied(prepared.disk),


Avoid persisting ahead-of-index snapshots during config replay

When the skip gate seeds e.applied to T for an FSM already ahead of snapshot S, Raft still re-emits committed configuration entries between S and T because newRawNode does not set an applied floor. For a configuration entry at C, applyConfChangeCommitted calls persistConfigState(C, ...); since configIndex is still initialized from S, snapshotPayload(C) serializes the already-T FSM and persists it as a Raft snapshot whose metadata index is only C. Installing that snapshot elsewhere exposes state through T while Raft believes only C is applied, causing subsequent entries C+1..T to be replayed against ahead state. Duplicate configuration entries still need ApplyConfChange, but their config-snapshot persistence must be suppressed or otherwise kept aligned with the FSM state.

Useful? React with 👍 / 👎.

bootjp added 5 commits June 5, 2026 15:42

gemini-code-assist Bot reviewed Jun 5, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

bootjp mentioned this pull request Jun 5, 2026

Jepsen Scheduled Stress (Redis): duplicate-elements + future-read + G-single/G2-item-realtime on main #937

Open

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread internal/raftengine/etcd/wal_store.go Outdated

Comment thread internal/raftengine/etcd/wal_store.go Outdated

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

bootjp enabled auto-merge June 7, 2026 10:18

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

bootjp merged commit ca0b4e9 into main Jun 7, 2026
13 checks passed

bootjp deleted the impl/snapshot-skip-b3-skip-gate branch June 7, 2026 10:33

Conversation

bootjp commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reading order (6 commits, designed for one-at-a-time review)

What this does

CRC verification (mirrors openAndRestoreFSMSnapshot)

Metrics + INFO log

Design constraints honoured

Test results

What this does NOT do

Soak plan

Refs

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 5, 2026

TLA+ spec divergence review (auto-triggered)

Uh oh!

bootjp commented Jun 5, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 5, 2026

Uh oh!

claude Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #934 Review — Branch 3: Cold-Start Snapshot-Restore Skip Gate

Overall assessment

Finding 1 — ParseSnapshotHeader error short-circuits before the full-body CRC is verified (design trade-off to document)

Finding 2 — have == want boundary case is not tested

Finding 3 — RestoreFallback discards snapIndex; the appliedIndexGap metric emits nothing for fallback — confirm intentional

Finding 4 — TLA+ spec: no divergence

Minor notes (no action required)

Summary

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

TLA+ spec divergence review (auto-triggered)

Uh oh!

claude Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #934 Round-2 Review — ColdStartObserver factory wiring (commit 320d06e)

Codex P2 — Closed correctly ✓

One minor note (no action required)

Round-1 findings status

Overall

Uh oh!

chatgpt-codex-connector Bot commented Jun 5, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 7, 2026

TLA+ spec divergence review (auto-triggered)

Uh oh!

chatgpt-codex-connector Bot commented Jun 7, 2026

bootjp commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

CRC verification (mirrors `openAndRestoreFSMSnapshot`)

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

claude Bot commented Jun 5, 2026 •

edited

Loading

Finding 1 — `ParseSnapshotHeader` error short-circuits before the full-body CRC is verified (design trade-off to document)

Finding 2 — `have == want` boundary case is not tested

Finding 3 — `RestoreFallback` discards `snapIndex`; the `appliedIndexGap` metric emits nothing for fallback — confirm intentional

claude Bot commented Jun 5, 2026 •

edited

Loading

PR #934 Round-2 Review — ColdStartObserver factory wiring (commit `320d06e`)

claude Bot commented Jun 7, 2026 •

edited

Loading

PR #934 Round-3 Review — Gate cold-start skip on `lastWalIndex` (codex P1, commit `57467ed`)

Finding 1 — `ColdStartObserver` interface docstring stale after semantic change (Minor)

claude Bot commented Jun 7, 2026 •

edited

Loading

PR #934 Round-5 Review — Committed-index gate + executed-gauge underflow fix (commit `1b166a5`)

Finding 1 — `recordingObs.RestoreExecuted` uses `snapIndex - have` (no abs), while the production impl now uses `|snapIndex - have|` — test assertions are technically undefined behaviour

Finding 2 — `gap_ahead` log field uses `have - target` (committed tail) while the metric uses `have - snapIndex` — operators see different baselines

Finding 3 — `coldStartSkipThreshold` `Commit < snapIndex` case: defensive but realistic in one crash window (no action required)

`setApplied` advance-only + conf-change re-delivery — correct ✓

claude Bot commented Jun 7, 2026 •

edited

Loading