feat(snapshot-skip B3): cold-start restoreSnapshotState skip gate + CRC verifier + metrics#934
Conversation
Two-phase interface for the cold-start skip path to preserve the
snapshot header state (HLC ceiling, Stage 8a cutover) the FSM's
Restore would normally apply, without running the multi-GiB body
restore.
- ParseSnapshotHeader(r) — reads the v1/v2 header from a caller-
supplied reader (wrapped in a crc32 TeeReader by the engine)
and drains the rest so the wrapping CRC covers the full
payload. Returns the parsed (ceiling, cutover) WITHOUT mutating
FSM state. Errors propagate; no side-effect on error.
- ApplySnapshotHeader(ceiling, cutover) — pure assignment of the
verified header state. Engine calls this only after Parse
returned and the crc32 hash matched the file footer.
Splitting parse from apply lets the CRC verifier stay co-located
with its private helpers in internal/raftengine/etcd (matching
openAndRestoreFSMSnapshot's safety contract) while the v1/v2 header
parser stays inside the kv package where kv.ReadSnapshotHeader
already lives. Neither package imports the other in production.
Interface only; implementation + skip gate land in subsequent
commits. Refs PR #910 design §5 round-7 + post-B2 implementation
series.
Implements the two-phase opt-in interface added in step 1.
- ParseSnapshotHeader(r) — wraps r in a bufio.Reader, calls
kv.ReadSnapshotHeader (handles v1/v2/headerless/unknown magic),
drains remaining body bytes via io.Copy(io.Discard, br) so the
engine's wrapping crc32 TeeReader covers every payload byte.
Returns the parsed (ceiling, cutover) tuple WITHOUT mutating
f.hlc or f.restoredCutover — that side-effect waits for phase 2
after the engine confirms the CRC.
- ApplySnapshotHeader(ceiling, cutover) — pure assignment of the
verified header state. Mirrors the two side-effects from Restore
that are scoped to the header portion: SetPhysicalCeiling +
restoredCutover.
The drain pass is essential: without it, the engine's TeeReader
would only hash the first ~26 bytes of the body (whatever
ReadSnapshotHeader consumed) and the wrapping CRC check would not
correspond to the file footer. With the drain, the CRC matches
restoreAndComputeCRC's full-payload boundary in
openAndRestoreFSMSnapshot.
Added compile-time guard in fsm_applied_index_iface_check.go:
var _ raftengine.SnapshotHeaderApplier = (*kvFSM)(nil)
so a future rename or signature drift fails 'go build' immediately
rather than silently degrading the skip optimisation.
Tests: go vet ./kv/ + golangci-lint clean. Skip-gate consumer lands
in subsequent commits. Refs PR #910 design §5 round-7 + B3 step 1.
…s 3+4) Wires the user-visible cold-start optimisation from the PR #910 design. After this commit the engine restores the FSM only when strictly necessary; the common case (local restart with a fresh fsm.db) collapses to a header read + CRC verification, skipping the multi-GiB body restore that PR #909's HEALTH_TIMEOUT_SECONDS=300 band-aid was sized to absorb. restoreSnapshotState now branches on fsmAlreadyAtIndex: if fsmAlreadyAtIndex(fsm, tok.Index) { return applyHeaderStateOnSkip(fsm, snapPath, tok.CRC32C) } return openAndRestoreFSMSnapshot(...) // unchanged fsmAlreadyAtIndex returns true ONLY when the FSM exposes raftengine.AppliedIndexReader AND LastAppliedIndex reports (have, true, nil) AND have >= want. Any uncertainty -- FSM doesn't opt in, store reports missing meta key, LastAppliedIndex returns an error -- collapses to false so the caller falls back to the full restore path. Strictly-additive guarantee (PR #910 §4): a wasteful restore is strictly safer than a stale-but-incorrect skip. applyHeaderStateOnSkip mirrors openAndRestoreFSMSnapshot's three- step safety contract exactly: 1. fsmMinFileSize -> ErrFSMSnapshotTooSmall 2. readFSMFooter vs tokenCRC -> ErrFSMSnapshotTokenCRC 3. crc32 TeeReader over the full payload -> ErrFSMSnapshotFileCRC Only after all three pass does the FSM's ApplySnapshotHeader run (pure assignment of ceiling + cutover). FSMs that do not implement raftengine.SnapshotHeaderApplier (legacy test fakes) still get the CRC verification but no header side-effects -- they have no header state to carry forward. Helpers extracted to keep cyclomatic complexity under 10: - verifyFSMSnapshotPrefix: steps 1 + 2 (size + footer-vs-token). - readSnapshotHeaderOrDrain: step 3's seam-vs-no-seam branch (delegates to setter.ParseSnapshotHeader OR drains io.Discard through the tee). Imports gained: 'hash/crc32' and the parent raftengine package (for the AppliedIndexReader + SnapshotHeaderApplier interfaces). Tests: go vet + golangci-lint clean across etcd/, kv/, store/; existing test suite still green (internal/raftengine/etcd 21.5s, kv 9.5s, store 23.9s). Refs PR #910 design §4 + §5 round-7; closes the user-visible cold-start side of the implementation series.
…(B3 step 5)
Wires Prometheus metrics + structured INFO logs around the
restoreSnapshotState skip decision so operators can verify the
optimisation engages in production (design target: >= 90% skip rate
in steady state after B2's soak phase).
internal/raftengine/cold_start.go — new ColdStartObserver
interface with three methods (RestoreSkipped / RestoreExecuted /
RestoreFallback). Lives in the parent raftengine package so
monitoring can implement it without circular import. Contract
documents the strictly-additive fallback reason enum (not_reader /
missing_meta / read_err) and the no-blocking requirement (cold-
start critical path).
internal/raftengine/etcd/engine.go — OpenConfig gains an optional
ColdStartObserver field. nil disables metrics; the skip itself
still runs.
internal/raftengine/etcd/wal_store.go — restoreSnapshotState
signature gains (obs, logger) parameters. The body switches from
the bool fsmAlreadyAtIndex to a coldStartDecision enum
(skip / execute / fallback-{not_reader,missing_meta,read_err})
returned by decideSkipOutcome. reportColdStart dispatches to the
observer + emits a zap.Info log on every cold-start.
- 'restoreSnapshotState skipped' with fsm_applied / snapshot_index
/ gap_ahead. The user-visible perf win logged once per restart.
- 'restoreSnapshotState executed (FSM behind snapshot)' with
gap_behind so operators can see how stale the FSM was.
- 'restoreSnapshotState fallback to full restore' with reason so
operators can chase 'why is the optimisation not engaging' in
production.
monitoring/cold_start.go — newColdStartMetrics + ColdStartObserver
struct. Two series:
elastickv_fsm_cold_start_restore_total{outcome,fallback_reason}
elastickv_fsm_cold_start_applied_index_gap{outcome}
(See doc-comments for the precise label semantics.)
monitoring/registry.go — Registry gains ColdStartObserver()
accessor. main.go (next commit, not in this PR) wires it to the
engine via OpenConfig.
Existing test call sites of restoreSnapshotState
(wal_store_test.go * 8 + grpc_transport_test.go * 1) updated to
pass (nil, nil) — tests opt out of metrics. The legacy bootstrap-
path loadWalState recursion in wal_store.go:522 also passes nil
because OpenConfig wiring has not yet reached that call.
Tests + lint: go vet + golangci-lint clean across all touched
packages; go test ./internal/raftengine/etcd/ ./kv/ ./store/
./monitoring/ -short ok 55s total.
Refs PR #910 design §9. B3 step 5 of the implementation series.
…tep 6)
Nine new tests covering the Branch 3 skip gate end-to-end:
skip-gate behaviour (5):
- TestSkipGate_SkipsWhenFSMFreshEnough — applied >= snap.Index
triggers the skip path; FSM body stays empty; observer records
one RestoreSkipped with gap_ahead; ApplySnapshotHeader fires
with the parsed ceiling.
- TestSkipGate_ExecutesWhenFSMStale — applied < snap.Index runs
the full restore; ApplySnapshotHeader does NOT fire; observer
records one RestoreExecuted with gap_behind.
- TestSkipGate_FallbackMissingMeta — appliedPresent=false falls
back to full restore with reason='missing_meta'.
- TestSkipGate_FallbackReadErr — LastAppliedIndex error collapses
to fallback (NOT propagated) with reason='read_err'.
- TestSkipGate_FallbackNotReader — dummyFSM (no
AppliedIndexReader) falls back with reason='not_reader'.
CRC failure modes (3):
- TestApplyHeaderStateOnSkip_TruncatedFile — empty file triggers
step 1 ErrFSMSnapshotTooSmall; FSM state unchanged.
- TestApplyHeaderStateOnSkip_WrongTokenCRC — bad tokenCRC arg
triggers step 2 ErrFSMSnapshotTokenCRC; FSM state unchanged.
- TestApplyHeaderStateOnSkip_BodyCorruption — XOR one byte of
the body in-place; step 1 + 2 pass but step 3
(full-body CRC) catches it via ErrFSMSnapshotFileCRC; FSM
state unchanged.
kvFSM contract (1):
- TestSkipGate_KVFSMHeaderRoundTrip — production *kv.kvFSM
satisfies SnapshotHeaderApplier; v1 header (EKVTHLC1 +
ceilingMs BE) is parsed and applied; hlc.PhysicalCeiling()
reflects the ceiling; RestoredCutover() is 0 for v1.
Test doubles:
- skipGateFSM implements StateMachine + AppliedIndexReader +
SnapshotHeaderApplier; configurable via field assignment.
- recordingObs implements raftengine.ColdStartObserver; records
every callback for assertion.
Test results:
- go test ./internal/raftengine/etcd/ -run 'TestSkipGate|TestApplyHeaderStateOnSkip' -count=1
-> ok 0.028s
- Full short test sweep across etcd/, kv/, store/, monitoring/
-> ok ~57s total.
- go vet + golangci-lint clean across all touched packages.
Closes B3 implementation series. Refs PR #910 design §4 / §5 / §9
test list.
|
Warning Review limit reached
More reviews will be available in 43 minutes and 8 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR adds a cold-start skip-gate that can avoid full FSM snapshot restores by applying verified snapshot headers when safe, wires a ColdStartObserver through factory/engine/metrics, enforces advance-only applied-index updates, and includes Prometheus metrics plus comprehensive tests for skip/execute/fallback and CRC/header verification. ChangesCold-Start Snapshot Skip-Gate Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (ce277e6):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
@claude review Branch 3 of the PR #910 design — the user-visible cold-start perf win. Follows the merged Branch 2 (PR #915 — metaAppliedIndex plumbing). 6 commits, focused review pathways: Review observation surface:
|
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 4m 16s —— View job PR #934 Review — Branch 3: Cold-Start Snapshot-Restore Skip GateTodo list:
Overall assessmentThe implementation is correct and the core design invariants are faithfully preserved. The three design constraints given in the trigger — strictly-additive fallback (§4), two-phase seam (§5/round-7), and CRC mirroring (§5/round-6) — are all correctly implemented. Four findings below, in descending severity. Finding 1 —
|
There was a problem hiding this comment.
Code Review
This pull request introduces a cold-start snapshot-restore skip-gate optimization. It allows skipping the restoration of a multi-GiB snapshot body if the on-disk state machine (FSM) is already as fresh as or fresher than the snapshot. The changes include the definition of the ColdStartObserver and SnapshotHeaderApplier interfaces, integration into the WAL loading and FSM restore paths, implementation of Prometheus metrics for monitoring skip-gate outcomes, and comprehensive unit tests. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce277e6667
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // nil disables metrics; the skip itself still runs. See | ||
| // docs/design/2026_06_02_idempotent_snapshot_restore.md §9 and | ||
| // internal/raftengine/cold_start.go for the contract. | ||
| ColdStartObserver raftengine.ColdStartObserver |
There was a problem hiding this comment.
Plumb the cold-start observer through the factory
In the production startup path this observer is never populated: main.go creates monitoring.NewRegistry(...), but engines are opened through internal/raftengine/etcd.Factory.Create, and that method builds OpenConfig without a ColdStartObserver field because raftengine.FactoryConfig has no way to carry it. As a result loadWalState always receives obs == nil in normal server runs, so the new cold-start counters/gauges remain empty even when skips/executions/fallbacks happen. Please thread Registry.ColdStartObserver() through the factory config/open call.
Useful? React with 👍 / 👎.
PR #934 round-1 codex P2: in production, monitoring.NewRegistry's ColdStartObserver was never reaching the engine because internal/raftengine/etcd.Factory.Create built OpenConfig solely from the engine-agnostic raftengine.FactoryConfig (which has no place to carry the observer). The Branch 3 cold-start counters/gauges therefore stayed at zero in real runs even when skips/executions/fallbacks happened. Thread it via the engine-specific etcd.FactoryConfig (already holds TickInterval/HeartbeatTick/etc.) so the change is local to the etcd package and main.go — buildShardGroups, buildRuntimeForGroup, and the shared raftengine.FactoryConfig stay untouched. Wiring: - etcd.FactoryConfig gains ColdStartObserver raftengine.ColdStartObserver - Factory.Create copies it into OpenConfig{} - newRaftFactory in main.go now takes the observer arg and passes metricsRegistry.ColdStartObserver() through - moved the factory construction below metricsRegistry creation so the observer is available at factory build time Tests: - 3 existing newRaftFactory callers updated to pass nil (no metrics needed in those test scopes) - TestSkipGate_*/TestApplyHeaderStateOnSkip_*/TestSkipGate_KVFSMHeaderRoundTrip (9 tests) still pass — they exercise restoreSnapshotState directly with a recording observer test double, unaffected by the factory plumbing Verified: go vet ./..., golangci-lint run ./internal/raftengine/etcd/... ./monitoring/... ./internal/raftengine/... ./., and go test -count=1 ./internal/raftengine/etcd/ all clean.
|
@claude review Round-1 codex P2 addressed (commit 320d06e):
Kept the change inside the engine-specific Verified: |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (320d06e):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
Claude finished @bootjp's task in 2m 33s —— View job PR #934 Round-2 Review — ColdStartObserver factory wiring (commit 320d06e)Todo list:
Codex P2 — Closed correctly ✓The full wiring chain is intact:
Test callers: Bootstrap recursive path ( Nil-interface safety: One minor note (no action required)
Round-1 findings status
OverallP2 is closed correctly. The change is minimal and surgical — only |
|
To use Codex here, create a Codex account and connect to github. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 320d06efb8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| decision, have := decideSkipOutcome(fsm, tok.Index) | ||
| reportColdStart(obs, logger, decision, tok.Index, have) | ||
| if decision == coldStartSkip { | ||
| return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C) |
There was a problem hiding this comment.
Prevent skipping when WAL will replay newer entries
When the FSM has applied entries after the latest persisted snapshot (the normal interval between snapshots, since metaAppliedIndex advances on each Apply), LastAppliedIndex() > tok.Index takes this skip path even though Open still initializes the raft applied floor from LocalSnap.Metadata.Index and openAndReadWAL reloads the post-snapshot entries. Those entries are then replayed into a Pebble store that already contains them; transaction replays can hit OCC conflicts before kvFSM observes their commit timestamps, leaving the restarted node’s HLC below timestamps already present on disk. The skip needs to advance/suppress replay through have, or only skip when there are no post-snapshot WAL entries to replay.
Useful? React with 👍 / 👎.
… (codex P1 #934) When the FSM has applied entries past the latest persisted snapshot (the normal interval between snapshots — metaAppliedIndex advances on each Apply), the prior skip path fired on LastAppliedIndex() > tok.Index even though Open still initializes raft from LocalSnap.Metadata.Index and openAndReadWAL replays the post-snapshot WAL entries. Those entries (tok.Index+1 .. have) would re-apply onto a Pebble store that already contains them, hitting OCC conflicts in the transaction commit path before kvFSM observes the commit timestamps, and leaving the HLC ceiling below timestamps already on disk. Fix: reorder loadWalState so the WAL is opened BEFORE the skip-gate decision. Compute lastWalIndex (max entry Index in the WAL tail, or snapshot.Metadata.Index when empty) and gate skip on 'have >= lastWalIndex' instead of 'have >= tok.Index'. When the WAL still carries entries beyond have, the gate falls through to the full-restore path, which then runs through the normal apply path without double-applies. Signature: restoreSnapshotState gains a lastWalIndex uint64 parameter. reportColdStart now takes (snapIndex, target, have) so the structured-log fields capture both 'snapshot_index' (original tok.Index for operator context) and 'last_wal_index' (the threshold actually used), and gap_ahead / gap_behind are computed against the threshold. Caller audit (signature change): - restoreSnapshotState production caller: loadWalState only. Updated. - Test call sites in wal_store_test.go (8), wal_store_skip_gate_test.go (5), grpc_transport_test.go (1) all updated to pass snap.Metadata.Index as the third argument — preserves the original test semantics where the threshold equals the snapshot index. - reportColdStart is private to this file; single call site updated. Regression: TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries - snap.Index=100, fsm.applied=150, lastWalIndex=200; the skip path would previously fire (have>snapIndex) but must now execute because have<lastWalIndex. Verifies fsm.bodyBytes (full restore ran), observer.executed=[lastWalIndex-have], and no skip recorded. Observer interface unchanged — RestoreSkipped/RestoreExecuted still take (snapIndex, have), but the value passed for 'snapIndex' is now the threshold (lastWalIndex). This keeps the metric semantically meaningful as 'gap relative to where the WAL replay would advance to'.
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (57467ed):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
|
@claude review Addresses codex P1 #934 ( Bug: when FSM applied > tok.Index (normal interval between snapshots — metaAppliedIndex advances each Apply), the skip path fired on Fix: reorder Caller audit: Regression: go test + golangci-lint clean on |
|
Claude finished @bootjp's task in 9m 47s —— View job PR #934 Round-3 Review — Gate cold-start skip on
|
| Finding | Severity | Status |
|---|---|---|
F1 — ColdStartObserver docstrings + Prometheus Help stale (snapIndex → WAL tail) |
Minor | Open |
F2 — skip path with non-empty WAL untested (have=lastWalIndex>tok.Index) |
Minor | Open |
| F3 — WAL close on error path | n/a | ✓ Clean |
| TLA+ spec divergence | n/a | ✓ None |
| Codex P1 core fix | — | ✓ Correct |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 57467ed085
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| decision, have := decideSkipOutcome(fsm, lastWalIndex) | ||
| reportColdStart(obs, logger, decision, tok.Index, lastWalIndex, have) | ||
| if decision == coldStartSkip { | ||
| return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C) |
There was a problem hiding this comment.
Advance Raft's applied floor when skipping the restore
When a persisted snapshot is at index S and the FSM has already applied through the WAL tail T, this new condition selects the skip path, but Open still initializes Engine.applied and the RawNode applied floor from LocalSnap.Metadata.Index (S). Consequently Raft emits the committed entries S+1..T again and applyCommitted replays them against the already-advanced Pebble state, which can produce OCC failures or repeat other non-idempotent FSM effects. The new have >= lastWalIndex gate is fresh evidence beyond the earlier review: it guarantees that every post-snapshot WAL entry is already represented in the FSM without adding any corresponding mechanism to suppress its replay.
Useful? React with 👍 / 👎.
| case coldStartSkip: | ||
| if obs != nil { | ||
| obs.RestoreSkipped(target, have) |
There was a problem hiding this comment.
Report snapshot-relative gaps to the observer
Pass snapIndex here rather than target: ColdStartObserver.RestoreSkipped defines its first argument as the snapshot index, and the Prometheus implementation calculates have - snapIndex, but this call supplies lastWalIndex. Whenever the snapshot trails the WAL, the exported gauge therefore reports distance from the WAL tail (often zero) while its help text claims distance from the snapshot pointer; the executed callback below has the same argument mismatch.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
internal/raftengine/etcd/wal_store_skip_gate_test.go (3)
85-95: ⚡ Quick winMisleading parameter names in test double.
The
recordingObsmethods name the first parametersnapIndex, but it receivestarget(lastWalIndex) from the actual implementation (seereportColdStartin wal_store.go:354-386). This is confusing becausetargetandsnapIndexcan differ—when the WAL carries post-snapshot entries,target > snapIndex. The gap calculations happen to work because the tests passsnap.Metadata.IndexaslastWalIndex, making them equal, but the misleading names make the test harder to maintain.♻️ Rename parameters to match the interface contract
-func (o *recordingObs) RestoreSkipped(snapIndex, have uint64) { +func (o *recordingObs) RestoreSkipped(target, have uint64) { o.mu.Lock() defer o.mu.Unlock() - o.skipped = append(o.skipped, have-snapIndex) + o.skipped = append(o.skipped, have-target) } -func (o *recordingObs) RestoreExecuted(snapIndex, have uint64) { +func (o *recordingObs) RestoreExecuted(target, have uint64) { o.mu.Lock() defer o.mu.Unlock() - o.executed = append(o.executed, snapIndex-have) + o.executed = append(o.executed, target-have) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/raftengine/etcd/wal_store_skip_gate_test.go` around lines 85 - 95, Rename the parameters in the recordingObs test double so they reflect the interface contract: change the first parameter name from snapIndex to target (or lastWalIndex) in both RestoreSkipped and RestoreExecuted to match how reportColdStart in wal_store.go calls them; update any internal calculations or variable names in RestoreSkipped and RestoreExecuted to use target and snapIndex appropriately (e.g., compute gaps as have - target or target - have only where semantically correct) so the test double’s signature and names align with the real invocation and avoid confusion.
117-119: 💤 Low valueExtract repeated header length constant.
The header length
16appears in multiple tests (here and line 327) and in the test double (line 59). Consider extracting it as a package constant for maintainability.♻️ Extract constant
Add near the top of the file:
const ( testV1HeaderLen = 16 // EKVTHLC1 (8 bytes) + ceiling BE uint64 (8 bytes) )Then replace the literals:
- payload := make([]byte, 16) + payload := make([]byte, testV1HeaderLen)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/raftengine/etcd/wal_store_skip_gate_test.go` around lines 117 - 119, Extract the repeated header length literal by adding a package-level constant (e.g., testV1HeaderLen = 16) near the top of internal/raftengine/etcd/wal_store_skip_gate_test.go and replace all literal uses (for example the make([]byte, 16) in the payload construction, any copy/slice ops assuming 16 bytes, and the test double referenced around the earlier helper) with that constant; update references such as the payload creation (payload := make([]byte, 16)), any index/slice bounds that assume the 16-byte header, and the test double setup to use testV1HeaderLen so the header size is maintained in one place.
293-315: 💤 Low valueConsider adding clarifying comment for intentional corruption.
The bit flip at line 307 is intentional test corruption, which is clear from context, but a brief comment would make the intent immediately obvious to future readers.
📝 Add clarifying comment
var b [1]byte _, err = f.ReadAt(b[:], 0) require.NoError(t, err) + // Flip a bit to simulate corruption b[0] ^= 0x01 _, err = f.WriteAt(b[:], 0)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/raftengine/etcd/wal_store_skip_gate_test.go` around lines 293 - 315, Add a short clarifying comment immediately above the intentional bit-flip (the line with b[0] ^= 0x01) in TestApplyHeaderStateOnSkip_BodyCorruption to state that the mutation is deliberate test corruption to ensure applyHeaderStateOnSkip detects CRC mismatch; reference the variables crc, path and the expectation ErrFSMSnapshotFileCRC and note that FSM state must not change (fsm.restoredHeader) so reviewers understand the intent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/raftengine/etcd/wal_store.go`:
- Around line 352-369: In reportColdStart, the ColdStartObserver callbacks are
called with the wrong first argument (passing target/lastWalIndex); change the
calls in reportColdStart to pass snapIndex as the first parameter so
RestoreSkipped and RestoreExecuted are invoked as (snapshotIndex,
haveAppliedIndex) — update the obs.RestoreSkipped(...) and
obs.RestoreExecuted(...) invocations in the reportColdStart function
accordingly.
- Around line 290-295: The current code calls reportColdStart before performing
the actual restore/skip, which records metrics even if applyHeaderStateOnSkip or
openAndRestoreFSMSnapshot fail; move the reportColdStart call so it runs only
after the chosen path successfully completes: call decideSkipOutcome(fsm,
lastWalIndex) to get decision/have, then perform either
applyHeaderStateOnSkip(fsm, fsmSnapPath(...), tok.CRC32C) or
openAndRestoreFSMSnapshot(...), check their returned error is nil, and only then
invoke reportColdStart(obs, logger, decision, tok.Index, lastWalIndex, have);
ensure both branches (skip and restore) return/report only on success and
propagate errors unchanged.
---
Nitpick comments:
In `@internal/raftengine/etcd/wal_store_skip_gate_test.go`:
- Around line 85-95: Rename the parameters in the recordingObs test double so
they reflect the interface contract: change the first parameter name from
snapIndex to target (or lastWalIndex) in both RestoreSkipped and RestoreExecuted
to match how reportColdStart in wal_store.go calls them; update any internal
calculations or variable names in RestoreSkipped and RestoreExecuted to use
target and snapIndex appropriately (e.g., compute gaps as have - target or
target - have only where semantically correct) so the test double’s signature
and names align with the real invocation and avoid confusion.
- Around line 117-119: Extract the repeated header length literal by adding a
package-level constant (e.g., testV1HeaderLen = 16) near the top of
internal/raftengine/etcd/wal_store_skip_gate_test.go and replace all literal
uses (for example the make([]byte, 16) in the payload construction, any
copy/slice ops assuming 16 bytes, and the test double referenced around the
earlier helper) with that constant; update references such as the payload
creation (payload := make([]byte, 16)), any index/slice bounds that assume the
16-byte header, and the test double setup to use testV1HeaderLen so the header
size is maintained in one place.
- Around line 293-315: Add a short clarifying comment immediately above the
intentional bit-flip (the line with b[0] ^= 0x01) in
TestApplyHeaderStateOnSkip_BodyCorruption to state that the mutation is
deliberate test corruption to ensure applyHeaderStateOnSkip detects CRC
mismatch; reference the variables crc, path and the expectation
ErrFSMSnapshotFileCRC and note that FSM state must not change
(fsm.restoredHeader) so reviewers understand the intent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ddfbe851-5668-47b2-a0e9-bfcd9c549572
📒 Files selected for processing (16)
internal/raftengine/cold_start.gointernal/raftengine/etcd/engine.gointernal/raftengine/etcd/factory.gointernal/raftengine/etcd/grpc_transport_test.gointernal/raftengine/etcd/wal_store.gointernal/raftengine/etcd/wal_store_skip_gate_test.gointernal/raftengine/etcd/wal_store_test.gointernal/raftengine/statemachine.gokv/fsm.gokv/fsm_applied_index_iface_check.gomain.gomain_bootstrap_e2e_test.gomain_bootstrap_test.gomonitoring/cold_start.gomonitoring/registry.gomultiraft_runtime_test.go
…bit Major #934) Four findings on review 4434778631 + the coderabbit review on the same head. Round-2 fix for the codex P1 plus the three Major / P2 items collected: 1. Codex P1: skip path didn't advance Raft's applied floor. Even with the lastWalIndex gate (which guarantees the FSM has applied through the WAL tail), Engine.Open still seeded e.applied from maxAppliedIndex(LocalSnap) = snapshot.Metadata.Index. Raft then delivered entries snap.Index+1..have to applyCommitted and replayed them onto a Pebble store already containing them - OCC conflicts in the kvFSM commit path; HLC ceiling inversion. Fix: restoreSnapshotState now returns (effectiveApplied uint64, error). diskState carries EffectiveApplied. Engine.Open seeds e.applied and engine.appliedIndex via coldStartApplied(disk) = max(maxAppliedIndex(LocalSnap), EffectiveApplied). Belt+braces: applyNormalCommitted skips entries with Index <= e.applied (idempotent no-op); setApplied is now advance-only so conf changes between snap.Index+1 and have do not regress the counter. 2. Codex P2 / coderabbit Major (observer args): the ColdStartObserver contract is (snapshotIndex, haveAppliedIndex) - monitoring/cold_start.go computes have - snapIndex for the gauge. Round-1 passed target=lastWalIndex instead, breaking the metric baseline. Revert to passing snapIndex. The 'target' / lastWalIndex is preserved as a structured-log field for operator context. 3. Coderabbit Major (emit-after-success): reportColdStart used to fire BEFORE applyHeaderStateOnSkip / openAndRestoreFSMSnapshot, so a CRC/header failure would still register a 'successful' outcome. Moved to after the restore-side work succeeds. Caller audit (signature + semantic): - restoreSnapshotState: now returns (uint64, error). Single production caller (loadWalState) plumbs into diskState.EffectiveApplied. 14 test call sites updated. - setApplied: advance-only. Callers are all in the apply loop (applyNormalCommitted, applyConfChangeCommitted, applyConfChangeV2Committed, applyCommitted's default case). The new gate prevents regression from conf changes between snap.Index+1 and have, which is exactly the case the codex P1 fix relies on. - coldStartApplied: new helper, called from Engine.Open only. Extracted restoreSnapshotStateFromToken to keep restoreSnapshotState under the nestif limit. New regression tests in wal_store_skip_gate_test.go: - TestSkipGate_ReturnsEffectiveAppliedOnSkip: skip path returns have, not snap.Metadata.Index. - TestSkipGate_EmitsAfterSuccess: CRC mismatch does NOT register obs.skipped / obs.executed (metrics emit only on success). go test + golangci-lint clean on ./internal/raftengine/etcd/.
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (c9b7c3b):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
@claude review Addresses all 4 findings on review 4434778631 + the coderabbit review on the same head:
Caller audit: Regression: go test + golangci-lint clean on |
|
To use Codex here, create a Codex account and connect to github. |
…e (codex P2 #934 round 3) Two findings on review 4434778631's c9b7c3b head: 1. Codex P2 (wal_store.go:160) — gate on the committed index. The round-2 fix used entries[n-1].Index (the WAL tail) as the skip threshold, but a follower's WAL can carry an UNCOMMITTED suffix that raft will not surface via Ready.CommittedEntries until the leader confirms it. If the FSM is applied through the committed index but trails that suffix, the gate forced a multi-GiB restore on every restart of any such follower, defeating the cold-start optimization. Fix: extract coldStartSkipThreshold(snapshot, hardState) = max(snapshot.Metadata.Index, hardState.Commit). Drops the walTail term — raft's invariant guarantees Commit ≤ walTail, so the walTail comparison was redundant for safety and harmful for the follower-with-uncommitted-suffix optimization. 2. Codex P2 (monitoring/cold_start.go:80) — avoid underflowing the executed gauge. After the round-1 gate change, an executed restore no longer implies have < snapIndex: a crashed follower can have FSM ahead of the snapshot pointer but behind the committed tail. The previous snapIndex - have subtraction underflowed into ~2^64, corrupting the gauge. Fix: ColdStartObserver.RestoreExecuted computes the absolute distance |snapIndex - have|. Comment updated to explain the new semantics (executed no longer implies stale FSM in the literal sense; it means full restore ran because gate did not skip). Caller audit: - coldStartSkipThreshold: pure helper, called once from loadWalState. No production-side downstream calls. - ColdStartObserver.RestoreExecuted: signature unchanged; behavior change is just the absolute-value computation. Callers (engine cold-start path) pass the same arguments. Regression tests: - TestColdStartSkipThreshold (wal_store_skip_gate_test.go): 4-case table — empty WAL, hardState below snapshot (pre-replay), hardState above snapshot, hardState zero. - TestColdStartObserver_ExecutedAbsoluteGap (new monitoring/cold_start_test.go): 3-case table — FSM behind / FSM ahead / FSM equal. - TestColdStartObserver_SkippedGap: skipped-path baseline. - TestColdStartObserver_NilSafe: nil observer is a no-op. - TestColdStartObserver_FallbackReasonLabel: 3-reason table covering not_reader / missing_meta / read_err. go test + golangci-lint clean on ./monitoring/ + ./internal/raftengine/etcd/.
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (1b166a5):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
@claude review Addresses codex P2 #934 round 3 (review 4434778631 → 2 new inline findings):
Regression tests:
|
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 3m 2s —— View job PR #934 Round-5 Review — Committed-index gate + executed-gauge underflow fix (commit 1b166a5)Todo list:
Overall assessmentBoth fixes are correct. The Finding 1 —
|
| Round | Finding | Severity | Status |
|---|---|---|---|
| R1-F1 | Header-error-before-CRC order doc | Minor | Open (doc-only) |
| R1-F2 | have == want boundary test |
Minor | Superseded by TestColdStartSkipThreshold + committed-index tests |
| R3-F1 | Observer docstrings stale (snapIndex → WAL tail) |
Minor | Closed — args reverted to snapIndex ✓ |
| R3-F2 | Skip path with non-empty WAL untested | Minor | Closed — TestColdStartSkipThreshold + restructured gate covers it ✓ |
| R5-F1 | recordingObs.RestoreExecuted underflow, hides correct abs-gap assertion |
Minor | Open |
| R5-F2 | gap_ahead log uses committed-tail baseline, metric uses snapshot baseline |
Minor | Open |
| Codex P2 #1 | Gate on hardState.Commit not WAL tail |
— | ✓ Correct |
| Codex P2 #2 | RestoreExecuted underflow in gauge |
— | ✓ Correct |
| TLA+ spec | No divergence | — | ✓ None |
The two remaining open items are minor quality issues; neither affects production correctness or the soak-plan metrics.
… (claude #934 round 5) Two findings on the 1b166a5 head: 1. recordingObs.RestoreExecuted (test double) used snapIndex - have, which underflows when have > snapIndex - the exact case the round-3 production fix handled. Tests passed only because both sides wrapped to the same value. Update the test double to mirror monitoring.ColdStartObserver's |snapIndex - have| and tighten TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries from a length-only assertion to the absolute gap (50). 2. The 'gap_ahead' / 'gap_behind' log fields used target (committed tail) as the baseline while the Prometheus gauges use snapIndex. Operators correlating the structured log against the metric saw different magnitudes whenever hardState.Commit > snap.Index (the post-replay normal case). Rename the log fields to be explicit: - last_committed_index (was last_wal_index; the helper now returns max(snap.Index, hardState.Commit), so the new name is more accurate) - gap_ahead_snapshot (skipped path; mirrors metric) - gap_ahead_committed (skipped path; new committed-tail relative) - gap_to_snapshot (executed path; absolute value, no underflow) - gap_behind_committed (executed path; matches gap_behind name semantics) Extracted the executed arm into reportColdStartExecute so reportColdStart stays under the cyclop limit (the abs-value computation pushes the inline branch over). Caller audit: reportColdStartExecute is private to wal_store.go, single call site from reportColdStart. recordingObs.RestoreExecuted is the test double; assertion update was straightforward (1 line). Tests + lint clean on ./internal/raftengine/etcd/ + ./monitoring/.
|
@claude review Addresses both round-5 findings:
Finding 3 (defensive Commit < snapIndex guard) — confirmed correct as-is by claude; no change. go test + golangci-lint clean on |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (d9bf745):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 1m 55s —— View job PR #934 Round-6 Review — recordingObs fix + log/metric baseline disambiguation (commit d9bf745)Todo list:
Overall assessmentBoth round-5 findings are correctly and completely closed. No new issues introduced. The implementation is clean. R5-F1 —
|
| Finding | Severity | Status |
|---|---|---|
| R1-F1 — header-error-before-CRC ordering (doc-only) | Minor | Open (doc comment; no safety impact) |
| R3-F2 — skip path with non-empty WAL untested | Minor | Open (nice-to-have) |
R5-F1 — recordingObs.RestoreExecuted underflow, masks correct abs-gap assertion |
Minor | ✅ Closed (d9bf745) |
R5-F2 — gap_ahead log uses committed-tail baseline, metric uses snapshot baseline |
Minor | ✅ Closed (d9bf745) |
| TLA+ spec divergence | n/a | ✅ None |
The implementation is correct and complete. All major and minor actionable findings across six review rounds are resolved. The two remaining open items are documentation-only (R1-F1) and test-coverage quality (R3-F2); neither affects production correctness, safety, or the soak-plan observability signals. Ready to merge.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9bf745f96
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if entry.Index <= e.applied { | ||
| return nil |
There was a problem hiding this comment.
Preserve post-snapshot HLC lease effects when skipping replay
When the committed tail contains an HLC lease entry after the snapshot followed by a KV mutation that advances metaAppliedIndex, the skip gate seeds e.applied at the tail and this guard drops both entries. ApplySnapshotHeader restores only the older snapshot-time ceiling, so the post-snapshot lease's in-memory SetPhysicalCeiling effect is never reconstructed. If that lease ceiling is still ahead of wall time, the restarted node can retain persisted timestamps above its HLC ceiling and reject fenced timestamp issuance or later issue timestamps from a stale ceiling; duplicate data mutations must be suppressed without suppressing volatile-only entries such as HLC leases.
Useful? React with 👍 / 👎.
| appliedIndexGap: prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
| Name: "elastickv_fsm_cold_start_applied_index_gap", | ||
| Help: "Absolute distance between the snapshot pointer and the FSM's reported applied index at cold start, in Raft log entries. outcome=skipped reports how far ahead the live store was; outcome=executed reports how far behind it was.", | ||
| }, []string{"outcome"}), |
There was a problem hiding this comment.
Partition cold-start metrics by Raft group
In multi-group mode, buildShardGroups opens several engines through one factory carrying the same registry observer, but this gauge is labeled only by outcome. Each shard therefore writes the same time series during startup, and the final value is merely whichever group's callback ran last rather than the gap for an identifiable FSM; the counters likewise aggregate multiple gate decisions despite the documentation describing at most one event per process. Include the Raft group in the observer or metric labels so multi-shard alerts and diagnostics remain meaningful.
Useful? React with 👍 / 👎.
… P1 #934 round 7) When the cold-start skip gate fired the engine seeded `e.applied` at the WAL committed tail and the duplicate guard (`entry.Index <= e.applied { return nil }`) then dropped every replayed entry — including HLC lease entries (raftEncodeHLCLease, tag 0x02). HLC leases have purely in-memory effects (`HLC.SetPhysicalCeiling`), so their post-snapshot ceiling raise was lost on every cold start. `ApplySnapshotHeader` then restored only the older snapshot-time ceiling and the next leader-issued fenced timestamp could collide with persisted commit_ts values stamped under the lost lease. Fix: - Add raftengine.VolatileEntryClassifier — an optional FSM seam that classifies cleartext FSM payloads as volatile-only or data-mutating. - kvFSM.IsVolatileOnlyPayload returns true iff payload[0] == raftEncodeHLCLease. Static iface assertion in fsm_applied_index_iface_check.go. - applyNormalEntry takes a new dropIfNonVolatile flag. On the cold-start duplicate path the engine still decodes the envelope and decrypts (post-cutover HLC leases are encrypted) so the cleartext tag is visible, then either: * passes the payload to fsm.Apply (volatile — replayed for its in-memory monotonic effect, NO setApplied, NO resolveProposal); or * returns (nil, nil) (data-mutating — dropped, preserving the OCC / sidecar idempotency from PR #934 round 1). - applyNormalCommitted forwards `duplicate` to applyNormalEntry and only advances setApplied + resolveProposal on the fresh path. Test: - cold_start_volatile_replay_test.go pins three contracts: (1) duplicate HLC lease → fsm.Apply called, e.applied unchanged; (2) duplicate KV/MVCC → fsm.Apply NOT called, e.applied unchanged; (3) FSM that does not implement the classifier → drop-all (no opt-in regression). - Also pins the fresh-entry asymmetry: every tag reaches fsm.Apply when entry.Index > e.applied. Self-review: 1. Data loss: HLC lease re-apply is monotonic (SetPhysicalCeiling) — no regression, no double-write. The guard for data-mutating entries is unchanged. 2. Concurrency: classifier called inside the existing applyNormalEntry path on the same goroutine — no new locks. 3. Performance: cold-start replay only; one extra envelope decode + (post-cutover) decryption per duplicate. Bounded by WAL tail length; not on a steady-state hot path. 4. Consistency: classifier sees the cleartext payload (post- decrypt), so post-cutover HLC leases are still recognised. Encryption opcodes (0x03..0x07) correctly classified as data- mutating — re-apply would diverge the sidecar RaftAppliedIndex. 5. Test coverage: three table-driven cases pinning every arm (volatile-replay, data-drop, classifier-absent), plus the fresh-entry asymmetry table.
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (20f90a9):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
@claude review Round 7 addresses codex P1: post-snapshot HLC lease effects were lost when the cold-start skip gate fired. Root cause. The Round-1 idempotency guard Fix.
Test.
Plus the fresh-entry asymmetry: every tag still reaches Caller audit. Open items not addressed in this commit.
CI note: previous run hit a 504 on reviewdog install — transient infra, not a |
|
Claude finished @bootjp's task in 3m 42s —— View job PR #934 Round-7 Review — Volatile HLC lease replay on cold-start skip (commit 20f90a9)Todo list:
Overall assessmentThe fix is correct. The root cause analysis is accurate, the seam is well-designed, and the three contracts tested are the right ones. One finding and one minor note below. Finding 1 —
|
| Property | Verification |
|---|---|
| HLC ceiling monotonicity: replay is idempotent | applyHLCLease → SetPhysicalCeiling is a monotonic raise; replaying raises ceiling to at least the post-snapshot level ✓ |
| OCC safety: data-mutating duplicates not replayed | IsVolatileOnlyPayload = false → applyNormalEntry returns (nil, nil) before SetApplyIndex/Apply ✓ |
setApplied not advanced for any duplicate |
applyNormalCommitted:2260 — both volatile and non-volatile duplicates hit if duplicate { return nil } before setApplied ✓ |
resolveProposal not called for volatile replay |
Same arm; no coordinator waiting for HLC lease proposals ✓ |
HaltApply guard runs before duplicate arm |
applyNormalCommitted:2252-2259 — checked on both fresh and duplicate path; HLC lease Apply returns nil so no halt risk ✓ |
| FSMs without classifier get drop-all (no regression) | !ok → return nil, nil in the type-assertion guard; TestApplyNormalCommitted_DuplicateWithoutClassifier_DropsAll pins it ✓ |
| Post-cutover (encrypted) HLC leases: decryption before classification | applyNormalEntry decrypts at line 2370-2391, then classifies at line 2401-2406 ✓ |
| Compile-time guard against interface drift | var _ raftengine.VolatileEntryClassifier = (*kvFSM)(nil) in kv/fsm_applied_index_iface_check.go ✓ |
TLA+ spec — no divergence ✓
kv/fsm.go anchor: IsVolatileOnlyPayload is a pure classification (no state change — explicitly documented). applyHLCLease / SetPhysicalCeiling path is unchanged from previous rounds. The fix ensures the ceiling is raised to at least the post-snapshot level, which is the HLC-4 precondition (i) requirement (bounded skew: newly elected leader's ceiling ≥ the largest ceiling stamped on any committed entry). HLC-4 preconditions (ii) and (iii) are unaffected. No OCC commit-ts, MVCC visibility, or routing-catalog logic is touched. No spec update required.
Round summary
| Round | Finding | Severity | Status |
|---|---|---|---|
| R1-F1 | Header-error-before-CRC ordering (doc-only) | Minor | Open (doc comment) |
| R4–R6 | All major findings (apply-floor advance, observer args, emit-after-success, abs-gap, log baseline) | Major/Minor | ✅ Closed |
| R7-F1 | SetApplyIndex called with stale index on volatile replay |
Minor / informational | Open (no current risk) |
| R7-F2 | Encrypted-volatile path not tested | Minor | Open (nice-to-have) |
| TLA+ spec divergence | No divergence | — | ✅ None |
| Codex P1 #934 round 7 | Post-snapshot HLC lease ceiling lost on skip | — | ✅ Correct |
The codex P1 root cause is correctly diagnosed and the fix is complete. Ready to merge; the two open items are informational and don't affect production correctness or the soak plan.
|
To use Codex here, create a Codex account and connect to github. |
…pted-volatile path (claude #934 round 7 F1+F2) R7-F1 (Minor / informational): on the cold-start duplicate-replay path entry.Index is BELOW e.applied, so calling SetApplyIndex would feed a stale index to any future durability sink (encryption sidecar RaftAppliedIndex, ApplyMutationsRaftAt's metaAppliedIndex bundle). Today's only volatile entry (HLC lease) does not read pendingApplyIdx so this was a no-op, but it future-proofs the seam against a new volatile entry type that also persists. SetApplyIndex now only fires on the !dropIfNonVolatile arm. R7-F2 (Minor / nice-to-have): the round-7 test fixtures all used newTestEngine(fsm, nil, nil) — no cipher, no cutover — so the decryption-before-classification ordering in applyNormalEntry was locked in code but not in tests. Added TestApplyNormalCommitted_VolatileDuplicate_PostCutoverEncrypted which wraps an HLC lease via WrapRaftPayload at an above-cutover index that's still BELOW e.applied. Verifies the engine decrypts, classifies as volatile, replays for the in-memory effect, and hands fsm.Apply the cleartext payload byte-for-byte. While extracting cyclop budget, factored the dropIfNonVolatile classifier arm into Engine.isVolatilePayload. Same semantics; cleaner call site. Self-review: 1. Data loss: no — replay path unchanged, SetApplyIndex suppression has no observable effect for HLC lease (the only current volatile type) but prevents future regression. 2. Concurrency: no new locks, no new goroutines. Same serialised apply loop. 3. Performance: one extra method call per cold-start duplicate. Bounded by WAL tail length; not on a steady-state path. 4. Consistency: encrypted-volatile path now pinned by test — post-cutover HLC lease decrypt → classify → replay is locked against future drift. 5. Test coverage: added explicit encrypted-volatile test; existing 3 contracts (volatile-replay / data-drop / classifier-absent) and fresh-entry asymmetry still pass.
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (fc55c7b):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
@claude review Round 7 F1+F2 follow-up — both Minor findings from your last review are now closed. R7-F1 — R7-F2 — Encrypted-volatile path untested. Added
Refactor note. Pulled the classifier check into Tests + lint clean locally:
R1-F1 (header-error-before-CRC ordering doc comment) is the only |
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 2m 35s —— View job PR #934 Round-8 Review — R7-F1 + R7-F2 follow-up (commit fc55c7b)Todo list:
Overall assessmentBoth R7-F1 and R7-F2 are correctly and completely closed. No new issues. The PR is clean. R7-F1 —
|
dropIfNonVolatile |
IsVolatileOnlyPayload |
SetApplyIndex called? |
fsm.Apply called? |
|---|---|---|---|
false (fresh entry) |
any | ✓ yes, at entry.Index |
✓ yes |
true (duplicate), data-mutating |
false |
✗ returns (nil,nil) early at line 2392 |
✗ |
true (duplicate), volatile |
true |
✗ guarded by !dropIfNonVolatile |
✓ yes, with cleartext payload |
All three paths are correct. The comment explains the future-proofing rationale clearly. The change is a pure no-op for all currently existing volatile entries (only HLC lease, which reads no pendingApplyIdx) but guards the seam against a future volatile type that also persists.
R7-F2 — Encrypted-volatile path untested — Closed correctly ✓
TestApplyNormalCommitted_VolatileDuplicate_PostCutoverEncrypted correctly exercises the full decrypt → classify → replay ordering:
cipher wired, cutover=100, e.applied=200
plain := []byte{0x02, 0, 0, 0, 0, 0, 0, 0, 0x01}
entry := envelopeEntry(t, c, kid, 150, plain)
Execution trace through applyNormalEntry:
decodeProposalEnvelope→ extracts the wrapped (encrypted) payloadentry.Index(150) > cutover(100)→unwrapRaftPayload→payload = plain = [0x02, ...]dropIfNonVolatile(true) && !isVolatilePayload([0x02,...])→isVolatilePayload = true→ does NOT return early!dropIfNonVolatile = false→SetApplyIndexsuppressedfsm.Apply([0x02,...])with cleartext
Three assertions:
fsm.calls = 1— decryption ran before classification, lease was replayed ✓fsm.lastPayload == plain— classifier received post-decrypt bytes ✓e.applied == 200— not advanced on duplicate arm ✓
This is precisely the scenario R7-F2 flagged. The test is correctly structured and locks in the ordering that protects against future regressions.
isVolatilePayload extraction — Clean ✓
The extraction is a pure refactor:
- Before: inline type assertion
+ okcheck inapplyNormalEntry(would push past cyclop budget with the additionalif !dropIfNonVolatilebranch) - After: extracted to
Engine.isVolatilePayload, identical semantics
The method comment accurately documents the "runs AFTER decryption" contract — critical for anyone reading the method in isolation. The !ok → false default is correctly stated as "preserving pre-PR semantics for FSMs that haven't opted in." ✓
TLA+ spec — No divergence ✓
This commit touches only engine.go and cold_start_volatile_replay_test.go. The kv/fsm.go anchor file (which triggers the TLA+ check on every push) is unchanged from the previous commits. As confirmed in Round-7:
applyHLCLease/SetPhysicalCeilingpath: unchanged, HLC-4 preconditions (i)/(ii)/(iii) unaffectedApplySnapshotHeader: unchanged- No OCC commit-ts, MVCC visibility, or routing-catalog changes
No spec update required.
Complete finding tracker (all rounds)
| Round | Finding | Severity | Status |
|---|---|---|---|
| R1-F1 | Header-error-before-CRC ordering (doc comment) | Minor | Open (doc-only, no safety impact) |
| R1-F2 → R3-F2 | Skip path coverage completeness | Minor | ✅ Superseded by committed-index gate restructure |
| R3-F1 → R6 | Observer docstrings / log field alignment | Minor | ✅ Closed |
| R4 | Apply-floor advance on skip (Codex P1) | Critical | ✅ Closed |
| R4 | Observer args (Codex P2 / CodeRabbit Major) | Major | ✅ Closed |
| R4 | Emit-after-success (CodeRabbit Major) | Major | ✅ Closed |
| R5 | Gate on hardState.Commit not WAL tail |
Major | ✅ Closed |
| R5 | RestoreExecuted underflow in gauge |
Major | ✅ Closed |
| R6 | Test double abs-gap drift | Minor | ✅ Closed |
| R7 | Post-snapshot HLC lease ceiling lost on skip (Codex P1) | Critical | ✅ Closed |
| R7-F1 | SetApplyIndex with stale index on volatile replay |
Minor | ✅ Closed (fc55c7b) |
| R7-F2 | Encrypted-volatile path untested | Minor | ✅ Closed (fc55c7b) |
| TLA+ spec | No divergence across all rounds | — | ✅ None |
The sole remaining open item is R1-F1 — a doc comment explaining the header-parse-before-body-CRC ordering in applyHeaderStateOnSkip. You confirmed this as doc-only with no safety impact. If you'd like to add it before merging, the location is wal_store.go at the applyHeaderStateOnSkip function body. Otherwise the PR is ready to merge.
…tateOnSkip (claude #934 R1-F1) R1-F1 was the only open finding across 8 review rounds, marked doc- only with no safety impact. Add a comment in applyHeaderStateOnSkip spelling out why header parse runs before the body-CRC compare: - Both header-parse errors and CRC errors are equally fatal for the skip path (snapshot file corruption) and both must propagate without ever calling ApplySnapshotHeader. - CRC compare stays AFTER the header parse so the TeeReader has actually been drained when h.Sum32() is read; inverting the order would let a CRC mismatch surface on a truncated body even when the header was valid, muddying the operator-facing diagnostic. - Matches the same ordering openAndRestoreFSMSnapshot uses, so fallback paths see consistent error chains regardless of which routine ran. No code change; this is the only safe artifact of an audit comment. Tests pass; lint clean.
|
@claude review Final cleanup: R1-F1 doc comment in |
|
Claude finished @bootjp's task in 1m 34s —— View job PR #934 Round-9 Review — R1-F1 doc comment (final cleanup, commit 58db40c)Todo list:
R1-F1 — Header-error-before-CRC ordering — Closed correctly ✓The inline comment at // Error-ordering contract (claude #934 R1-F1): header parse
// errors surface BEFORE the body-CRC compare runs, so callers
// (the skip-gate fallback in restoreSnapshotState) may observe
// either an ErrSnapshotHeaderUnknownMagic / InvalidLength chain or
// an ErrFSMSnapshotFileCRC chain depending on which check fails
// first. This is the same ordering openAndRestoreFSMSnapshot has
// — both errors are equally fatal for the skip path (they signal
// snapshot file corruption) and both must propagate without ever
// calling ApplySnapshotHeader. The CRC check stays AFTER the
// header parse so the TeeReader has actually been drained before
// we read h.Sum32(); inverting the order would let a CRC mismatch
// surface on a truncated body even when the header was valid,
// muddying the operator-facing diagnostic.Three claims verified against the code:
The function-level docstring (lines 503-517) is also complete and accurate:
One very minor note (no action required): The attribution TLA+ spec — No divergence ✓This commit is documentation-only — no code changed. All HLC-4 preconditions, OCC commit-ts ordering, MVCC visibility invariants, and routing-catalog contracts are unaffected. No spec update required. Complete finding tracker (all rounds)
All findings across 9 rounds are closed. The implementation is correct, the soak-plan metrics are wired end-to-end, the safety invariants are preserved, and the documentation now fully explains the non-obvious ordering decisions. Ready to merge. |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (58db40c):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 58db40cd6e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| voterCount: len(prepared.disk.LocalSnap.Metadata.ConfState.Voters), | ||
| isLearnerNode: learnerSetFromConfState(prepared.disk.LocalSnap.Metadata.ConfState), | ||
| applied: maxAppliedIndex(prepared.disk.LocalSnap), | ||
| applied: coldStartApplied(prepared.disk), |
There was a problem hiding this comment.
Avoid persisting ahead-of-index snapshots during config replay
When the skip gate seeds e.applied to T for an FSM already ahead of snapshot S, Raft still re-emits committed configuration entries between S and T because newRawNode does not set an applied floor. For a configuration entry at C, applyConfChangeCommitted calls persistConfigState(C, ...); since configIndex is still initialized from S, snapshotPayload(C) serializes the already-T FSM and persists it as a Raft snapshot whose metadata index is only C. Installing that snapshot elsewhere exposes state through T while Raft believes only C is applied, causing subsequent entries C+1..T to be replayed against ahead state. Duplicate configuration entries still need ApplyConfChange, but their config-snapshot persistence must be suppressed or otherwise kept aligned with the FSM state.
Useful? React with 👍 / 👎.
Summary
Implements Branch 3 of the cold-start snapshot-restore skip optimisation designed in PR #910, building on PR #915 (B2 —
metaAppliedIndexplumbing). This is the user-visible perf win of the series: after this lands, the engine restores the FSM only when the on-disk state is actually behind the snapshot pointer; the common case (local restart with a fresh fsm.db) collapses to a header read + CRC verification, skipping the multi-GiB body restore that PR #909'sHEALTH_TIMEOUT_SECONDS=300band-aid was sized to absorb.Reading order (6 commits, designed for one-at-a-time review)
8312f4baraftengine.SnapshotHeaderAppliertwo-phase interface (ParseSnapshotHeader+ApplySnapshotHeader)024eb43ckvFSMimplements both phases + compile-time guard933dab86applyHeaderStateOnSkip(3-step CRC, mirrorsopenAndRestoreFSMSnapshot) +restoreSnapshotStateskip gatefsmAlreadyAtIndex+ skip-gate wiringraftengine.ColdStartObserver+monitoring.ColdStartMetrics+Registry.ColdStartObserver()ce277e66(Commit 5 was merged into the metrics commit; the actual git log is 5 commits.)
What this does
restoreSnapshotStatebranches ondecideSkipOutcome:decideSkipOutcomereturns one of 5 outcomes:skipLastAppliedIndex >= snap.IndexexecuteLastAppliedIndex < snap.Indexfallback_not_readerAppliedIndexReaderfallback_missing_metafallback_read_errLastAppliedIndexreturned an errorThe three fallback paths preserve the strictly-additive guarantee from the design doc §4: over-restoring is always strictly safer than skipping incorrectly. Errors from
LastAppliedIndexdo not abort cold start — they collapse to fallback.CRC verification (mirrors
openAndRestoreFSMSnapshot)applyHeaderStateOnSkipruns the same three-step safety contract as the full restore path. Failure of any step propagates a typed error WITHOUT mutating FSM state:info.Size() < fsmMinFileSizeErrFSMSnapshotTooSmallreadFSMFootervstokenCRCErrFSMSnapshotTokenCRCcrc32.TeeReadervs footerErrFSMSnapshotFileCRCThe FSM's
ApplySnapshotHeader(HLC ceiling + Stage 8a cutover) only fires after all three pass.Metrics + INFO log
raftengine.ColdStartObserverinterface lives in the parentraftenginepackage so monitoring can implement it without circular import.monitoring.RegistrygainsColdStartObserver(). The engine receives it throughOpenConfig.ColdStartObserver— nil disables metrics, the skip itself still runs.Two Prometheus series:
Plus a structured zap.Info log on every cold-start with fsm_applied / snapshot_index / gap and (for fallback) the reason enum.
Design constraints honoured
decideSkipOutcomecollapses every uncertainty to a fallback variant;LastAppliedIndexerrors do NOT abort cold start.ParseSnapshotHeaderreads the v1/v2 header from a caller-supplied reader and drains the rest;ApplySnapshotHeaderis pure assignment. Splits the parse from the side-effect so the engine can verify the CRC between them.openAndRestoreFSMSnapshot. The full-body CRC pass IS the slow part of this PR (~6s for a 6 GiB FSM at 1 GiB/s SSD read) — but still strictly cheaper than the restore it replaces, which also reads the file once AND writes a temp Pebble database viarestoreBatchLoopInto.not_reporterlabel is namednot_readerin the actual impl (matching the round-5 PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 rename).Engine.applySnapshot(engine.go:1641 InstallSnapshot hot path) is untouched. The TODO sentinel added in PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still names B3 as the follow-up site for the peer-after-InstallSnapshot bump — that's not in scope here either.Test results
What this does NOT do
HEALTH_TIMEOUT_SECONDS=300. That's Branch 4, gated on production data showing steady-state skip rate ≥ 90%.Engine.applySnapshotto populatemetaAppliedIndexfor peers receiving InstallSnapshot. The TODO sentinel from PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still flags this; addressing it requires a separate design pass.ELASTICKV_RAFT_SNAPSHOT_COUNT=10end-to-end). The unit tests cover the gate decision matrix; an integration test would prove the codex round-3 P2 scenario is closed end-to-end in production-like setup, which is most valuable as a follow-up after B3 has soaked.Soak plan
Same as B2: deploy and observe
elastickv_fsm_cold_start_restore_total{outcome="skipped"}ramp up across cold starts. The design target is steady-state skip rate ≥ 90%; the per-outcome label distribution and gap distribution are the primary signals.After one release of soak, Branch 4 can lower
HEALTH_TIMEOUT_SECONDS.Refs
metaAppliedIndexbeing populatedHEALTH_TIMEOUT_SECONDSband-aid this series eventually obviatesSummary by CodeRabbit
New Features
Tests