Skip to content

feat(snapshot-skip B3): cold-start restoreSnapshotState skip gate + CRC verifier + metrics#934

Merged
bootjp merged 13 commits into
mainfrom
impl/snapshot-skip-b3-skip-gate
Jun 7, 2026
Merged

feat(snapshot-skip B3): cold-start restoreSnapshotState skip gate + CRC verifier + metrics#934
bootjp merged 13 commits into
mainfrom
impl/snapshot-skip-b3-skip-gate

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 5, 2026

Summary

Implements Branch 3 of the cold-start snapshot-restore skip optimisation designed in PR #910, building on PR #915 (B2 — metaAppliedIndex plumbing). This is the user-visible perf win of the series: after this lands, the engine restores the FSM only when the on-disk state is actually behind the snapshot pointer; the common case (local restart with a fresh fsm.db) collapses to a header read + CRC verification, skipping the multi-GiB body restore that PR #909's HEALTH_TIMEOUT_SECONDS=300 band-aid was sized to absorb.

Reading order (6 commits, designed for one-at-a-time review)

# commit scope
1 8312f4ba raftengine.SnapshotHeaderApplier two-phase interface (ParseSnapshotHeader + ApplySnapshotHeader)
2 024eb43c kvFSM implements both phases + compile-time guard
3 933dab86 applyHeaderStateOnSkip (3-step CRC, mirrors openAndRestoreFSMSnapshot) + restoreSnapshotState skip gate
4 (folded into #3) fsmAlreadyAtIndex + skip-gate wiring
5 (next) metrics + INFO log: raftengine.ColdStartObserver + monitoring.ColdStartMetrics + Registry.ColdStartObserver()
6 ce277e66 9 new tests (skip-gate behaviour ×5 + CRC failure modes ×3 + kvFSM contract ×1)

(Commit 5 was merged into the metrics commit; the actual git log is 5 commits.)

What this does

restoreSnapshotState branches on decideSkipOutcome:

decision, have := decideSkipOutcome(fsm, tok.Index)
reportColdStart(obs, logger, decision, tok.Index, have)
if decision == coldStartSkip {
    return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C)
}
return openAndRestoreFSMSnapshot(...) // unchanged

decideSkipOutcome returns one of 5 outcomes:

outcome trigger action
skip LastAppliedIndex >= snap.Index header-only path
execute LastAppliedIndex < snap.Index full restore
fallback_not_reader FSM doesn't implement AppliedIndexReader full restore
fallback_missing_meta meta key absent (pre-upgrade fsm.db) full restore
fallback_read_err LastAppliedIndex returned an error full restore

The three fallback paths preserve the strictly-additive guarantee from the design doc §4: over-restoring is always strictly safer than skipping incorrectly. Errors from LastAppliedIndex do not abort cold start — they collapse to fallback.

CRC verification (mirrors openAndRestoreFSMSnapshot)

applyHeaderStateOnSkip runs the same three-step safety contract as the full restore path. Failure of any step propagates a typed error WITHOUT mutating FSM state:

Step Check Error
1 info.Size() < fsmMinFileSize ErrFSMSnapshotTooSmall
2 readFSMFooter vs tokenCRC ErrFSMSnapshotTokenCRC
3 full-body crc32.TeeReader vs footer ErrFSMSnapshotFileCRC

The FSM's ApplySnapshotHeader (HLC ceiling + Stage 8a cutover) only fires after all three pass.

Metrics + INFO log

raftengine.ColdStartObserver interface lives in the parent raftengine package so monitoring can implement it without circular import. monitoring.Registry gains ColdStartObserver(). The engine receives it through OpenConfig.ColdStartObserver — nil disables metrics, the skip itself still runs.

Two Prometheus series:

elastickv_fsm_cold_start_restore_total{outcome,fallback_reason}
elastickv_fsm_cold_start_applied_index_gap{outcome}

Plus a structured zap.Info log on every cold-start with fsm_applied / snapshot_index / gap and (for fallback) the reason enum.

Design constraints honoured

  • §4 strictly-additive fallback: decideSkipOutcome collapses every uncertainty to a fallback variant; LastAppliedIndex errors do NOT abort cold start.
  • §5 two-phase seam (round-7): ParseSnapshotHeader reads the v1/v2 header from a caller-supplied reader and drains the rest; ApplySnapshotHeader is pure assignment. Splits the parse from the side-effect so the engine can verify the CRC between them.
  • §5 CRC mirroring (round-6): 3-step verification (size + footer-vs-tokenCRC + full-body-CRC) exactly matches openAndRestoreFSMSnapshot. The full-body CRC pass IS the slow part of this PR (~6s for a 6 GiB FSM at 1 GiB/s SSD read) — but still strictly cheaper than the restore it replaces, which also reads the file once AND writes a temp Pebble database via restoreBatchLoopInto.
  • §9 observability: three outcomes + gap label; the design's not_reporter label is named not_reader in the actual impl (matching the round-5 PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 rename).
  • Non-Goals respected: Engine.applySnapshot (engine.go:1641 InstallSnapshot hot path) is untouched. The TODO sentinel added in PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still names B3 as the follow-up site for the peer-after-InstallSnapshot bump — that's not in scope here either.

Test results

go vet ./internal/raftengine/etcd/ ./internal/raftengine/ ./kv/ ./store/ ./monitoring/   → exit 0
golangci-lint run ... (same packages)                                                     → 0 issues
go test ./internal/raftengine/etcd/ ./kv/ ./store/ ./monitoring/ -short                   → ok ~57s total
go test ./internal/raftengine/etcd/ -run 'TestSkipGate|TestApplyHeaderStateOnSkip'        → ok 0.028s

What this does NOT do

  • Does NOT change HEALTH_TIMEOUT_SECONDS=300. That's Branch 4, gated on production data showing steady-state skip rate ≥ 90%.
  • Does NOT wire Engine.applySnapshot to populate metaAppliedIndex for peers receiving InstallSnapshot. The TODO sentinel from PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still flags this; addressing it requires a separate design pass.
  • Does NOT add an idle-cluster integration test (ELASTICKV_RAFT_SNAPSHOT_COUNT=10 end-to-end). The unit tests cover the gate decision matrix; an integration test would prove the codex round-3 P2 scenario is closed end-to-end in production-like setup, which is most valuable as a follow-up after B3 has soaked.

Soak plan

Same as B2: deploy and observe elastickv_fsm_cold_start_restore_total{outcome="skipped"} ramp up across cold starts. The design target is steady-state skip rate ≥ 90%; the per-outcome label distribution and gap distribution are the primary signals.

After one release of soak, Branch 4 can lower HEALTH_TIMEOUT_SECONDS.

Refs

Summary by CodeRabbit

  • New Features

    • Cold-start lifecycle callbacks added and wired through the raft engine to report skip/execute/fallback outcomes and applied-index gaps.
    • Prometheus metrics for cold-start outcomes and applied-index gap.
    • Snapshot-header read/apply contract introduced; KV FSM implements header apply and volatile-entry classification to support header-only skips and volatile-only replay on duplicates.
  • Tests

    • Extensive unit/integration tests covering skip/execute/fallback decisions, header verification, volatile duplicate replay, and updated restore-call signatures.

bootjp added 5 commits June 5, 2026 15:42
Two-phase interface for the cold-start skip path to preserve the
snapshot header state (HLC ceiling, Stage 8a cutover) the FSM's
Restore would normally apply, without running the multi-GiB body
restore.

  - ParseSnapshotHeader(r) — reads the v1/v2 header from a caller-
    supplied reader (wrapped in a crc32 TeeReader by the engine)
    and drains the rest so the wrapping CRC covers the full
    payload. Returns the parsed (ceiling, cutover) WITHOUT mutating
    FSM state. Errors propagate; no side-effect on error.
  - ApplySnapshotHeader(ceiling, cutover) — pure assignment of the
    verified header state. Engine calls this only after Parse
    returned and the crc32 hash matched the file footer.

Splitting parse from apply lets the CRC verifier stay co-located
with its private helpers in internal/raftengine/etcd (matching
openAndRestoreFSMSnapshot's safety contract) while the v1/v2 header
parser stays inside the kv package where kv.ReadSnapshotHeader
already lives. Neither package imports the other in production.

Interface only; implementation + skip gate land in subsequent
commits. Refs PR #910 design §5 round-7 + post-B2 implementation
series.
Implements the two-phase opt-in interface added in step 1.

  - ParseSnapshotHeader(r) — wraps r in a bufio.Reader, calls
    kv.ReadSnapshotHeader (handles v1/v2/headerless/unknown magic),
    drains remaining body bytes via io.Copy(io.Discard, br) so the
    engine's wrapping crc32 TeeReader covers every payload byte.
    Returns the parsed (ceiling, cutover) tuple WITHOUT mutating
    f.hlc or f.restoredCutover — that side-effect waits for phase 2
    after the engine confirms the CRC.

  - ApplySnapshotHeader(ceiling, cutover) — pure assignment of the
    verified header state. Mirrors the two side-effects from Restore
    that are scoped to the header portion: SetPhysicalCeiling +
    restoredCutover.

The drain pass is essential: without it, the engine's TeeReader
would only hash the first ~26 bytes of the body (whatever
ReadSnapshotHeader consumed) and the wrapping CRC check would not
correspond to the file footer. With the drain, the CRC matches
restoreAndComputeCRC's full-payload boundary in
openAndRestoreFSMSnapshot.

Added compile-time guard in fsm_applied_index_iface_check.go:

  var _ raftengine.SnapshotHeaderApplier = (*kvFSM)(nil)

so a future rename or signature drift fails 'go build' immediately
rather than silently degrading the skip optimisation.

Tests: go vet ./kv/ + golangci-lint clean. Skip-gate consumer lands
in subsequent commits. Refs PR #910 design §5 round-7 + B3 step 1.
…s 3+4)

Wires the user-visible cold-start optimisation from the PR #910
design. After this commit the engine restores the FSM only when
strictly necessary; the common case (local restart with a fresh
fsm.db) collapses to a header read + CRC verification, skipping
the multi-GiB body restore that PR #909's HEALTH_TIMEOUT_SECONDS=300
band-aid was sized to absorb.

restoreSnapshotState now branches on fsmAlreadyAtIndex:

  if fsmAlreadyAtIndex(fsm, tok.Index) {
      return applyHeaderStateOnSkip(fsm, snapPath, tok.CRC32C)
  }
  return openAndRestoreFSMSnapshot(...) // unchanged

fsmAlreadyAtIndex returns true ONLY when the FSM exposes
raftengine.AppliedIndexReader AND LastAppliedIndex reports
(have, true, nil) AND have >= want. Any uncertainty -- FSM doesn't
opt in, store reports missing meta key, LastAppliedIndex returns an
error -- collapses to false so the caller falls back to the full
restore path. Strictly-additive guarantee (PR #910 §4): a wasteful
restore is strictly safer than a stale-but-incorrect skip.

applyHeaderStateOnSkip mirrors openAndRestoreFSMSnapshot's three-
step safety contract exactly:

  1. fsmMinFileSize  -> ErrFSMSnapshotTooSmall
  2. readFSMFooter vs tokenCRC -> ErrFSMSnapshotTokenCRC
  3. crc32 TeeReader over the full payload -> ErrFSMSnapshotFileCRC

Only after all three pass does the FSM's ApplySnapshotHeader run
(pure assignment of ceiling + cutover). FSMs that do not implement
raftengine.SnapshotHeaderApplier (legacy test fakes) still get the
CRC verification but no header side-effects -- they have no header
state to carry forward.

Helpers extracted to keep cyclomatic complexity under 10:

  - verifyFSMSnapshotPrefix: steps 1 + 2 (size + footer-vs-token).
  - readSnapshotHeaderOrDrain: step 3's seam-vs-no-seam branch
    (delegates to setter.ParseSnapshotHeader OR drains
    io.Discard through the tee).

Imports gained: 'hash/crc32' and the parent raftengine package
(for the AppliedIndexReader + SnapshotHeaderApplier interfaces).

Tests: go vet + golangci-lint clean across etcd/, kv/, store/;
existing test suite still green
(internal/raftengine/etcd 21.5s, kv 9.5s, store 23.9s).

Refs PR #910 design §4 + §5 round-7; closes the user-visible
cold-start side of the implementation series.
…(B3 step 5)

Wires Prometheus metrics + structured INFO logs around the
restoreSnapshotState skip decision so operators can verify the
optimisation engages in production (design target: >= 90% skip rate
in steady state after B2's soak phase).

internal/raftengine/cold_start.go — new ColdStartObserver
interface with three methods (RestoreSkipped / RestoreExecuted /
RestoreFallback). Lives in the parent raftengine package so
monitoring can implement it without circular import. Contract
documents the strictly-additive fallback reason enum (not_reader /
missing_meta / read_err) and the no-blocking requirement (cold-
start critical path).

internal/raftengine/etcd/engine.go — OpenConfig gains an optional
ColdStartObserver field. nil disables metrics; the skip itself
still runs.

internal/raftengine/etcd/wal_store.go — restoreSnapshotState
signature gains (obs, logger) parameters. The body switches from
the bool fsmAlreadyAtIndex to a coldStartDecision enum
(skip / execute / fallback-{not_reader,missing_meta,read_err})
returned by decideSkipOutcome. reportColdStart dispatches to the
observer + emits a zap.Info log on every cold-start.

  - 'restoreSnapshotState skipped' with fsm_applied / snapshot_index
    / gap_ahead. The user-visible perf win logged once per restart.
  - 'restoreSnapshotState executed (FSM behind snapshot)' with
    gap_behind so operators can see how stale the FSM was.
  - 'restoreSnapshotState fallback to full restore' with reason so
    operators can chase 'why is the optimisation not engaging' in
    production.

monitoring/cold_start.go — newColdStartMetrics + ColdStartObserver
struct. Two series:

  elastickv_fsm_cold_start_restore_total{outcome,fallback_reason}
  elastickv_fsm_cold_start_applied_index_gap{outcome}

(See doc-comments for the precise label semantics.)

monitoring/registry.go — Registry gains ColdStartObserver()
accessor. main.go (next commit, not in this PR) wires it to the
engine via OpenConfig.

Existing test call sites of restoreSnapshotState
(wal_store_test.go * 8 + grpc_transport_test.go * 1) updated to
pass (nil, nil) — tests opt out of metrics. The legacy bootstrap-
path loadWalState recursion in wal_store.go:522 also passes nil
because OpenConfig wiring has not yet reached that call.

Tests + lint: go vet + golangci-lint clean across all touched
packages; go test ./internal/raftengine/etcd/ ./kv/ ./store/
./monitoring/ -short ok 55s total.

Refs PR #910 design §9. B3 step 5 of the implementation series.
…tep 6)

Nine new tests covering the Branch 3 skip gate end-to-end:

skip-gate behaviour (5):
  - TestSkipGate_SkipsWhenFSMFreshEnough — applied >= snap.Index
    triggers the skip path; FSM body stays empty; observer records
    one RestoreSkipped with gap_ahead; ApplySnapshotHeader fires
    with the parsed ceiling.
  - TestSkipGate_ExecutesWhenFSMStale — applied < snap.Index runs
    the full restore; ApplySnapshotHeader does NOT fire; observer
    records one RestoreExecuted with gap_behind.
  - TestSkipGate_FallbackMissingMeta — appliedPresent=false falls
    back to full restore with reason='missing_meta'.
  - TestSkipGate_FallbackReadErr — LastAppliedIndex error collapses
    to fallback (NOT propagated) with reason='read_err'.
  - TestSkipGate_FallbackNotReader — dummyFSM (no
    AppliedIndexReader) falls back with reason='not_reader'.

CRC failure modes (3):
  - TestApplyHeaderStateOnSkip_TruncatedFile — empty file triggers
    step 1 ErrFSMSnapshotTooSmall; FSM state unchanged.
  - TestApplyHeaderStateOnSkip_WrongTokenCRC — bad tokenCRC arg
    triggers step 2 ErrFSMSnapshotTokenCRC; FSM state unchanged.
  - TestApplyHeaderStateOnSkip_BodyCorruption — XOR one byte of
    the body in-place; step 1 + 2 pass but step 3
    (full-body CRC) catches it via ErrFSMSnapshotFileCRC; FSM
    state unchanged.

kvFSM contract (1):
  - TestSkipGate_KVFSMHeaderRoundTrip — production *kv.kvFSM
    satisfies SnapshotHeaderApplier; v1 header (EKVTHLC1 +
    ceilingMs BE) is parsed and applied; hlc.PhysicalCeiling()
    reflects the ceiling; RestoredCutover() is 0 for v1.

Test doubles:
  - skipGateFSM implements StateMachine + AppliedIndexReader +
    SnapshotHeaderApplier; configurable via field assignment.
  - recordingObs implements raftengine.ColdStartObserver; records
    every callback for assertion.

Test results:
  - go test ./internal/raftengine/etcd/ -run 'TestSkipGate|TestApplyHeaderStateOnSkip' -count=1
    -> ok 0.028s
  - Full short test sweep across etcd/, kv/, store/, monitoring/
    -> ok ~57s total.
  - go vet + golangci-lint clean across all touched packages.

Closes B3 implementation series. Refs PR #910 design §4 / §5 / §9
test list.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 43 minutes and 8 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d8b6e9eb-7059-4241-8cc6-060f5f4eadfa

📥 Commits

Reviewing files that changed from the base of the PR and between 20f90a9 and 58db40c.

📒 Files selected for processing (3)
  • internal/raftengine/etcd/cold_start_volatile_replay_test.go
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/wal_store.go
📝 Walkthrough

Walkthrough

This PR adds a cold-start skip-gate that can avoid full FSM snapshot restores by applying verified snapshot headers when safe, wires a ColdStartObserver through factory/engine/metrics, enforces advance-only applied-index updates, and includes Prometheus metrics plus comprehensive tests for skip/execute/fallback and CRC/header verification.

Changes

Cold-Start Snapshot Skip-Gate Feature

Layer / File(s) Summary
Contract definitions: ColdStartObserver and SnapshotHeaderApplier
internal/raftengine/cold_start.go, internal/raftengine/statemachine.go
ColdStartObserver declares RestoreSkipped/RestoreExecuted/RestoreFallback; SnapshotHeaderApplier and VolatileEntryClassifier are added as optional FSM seams.
kvFSM header parse/apply and volatile classifier
kv/fsm.go, kv/fsm_applied_index_iface_check.go
kvFSM implements ParseSnapshotHeader/ApplySnapshotHeader and IsVolatileOnlyPayload; compile-time assertions ensure interface conformance.
Factory and engine config wiring
internal/raftengine/etcd/factory.go, internal/raftengine/etcd/engine.go
Adds ColdStartObserver fields to FactoryConfig/OpenConfig and threads the observer into engine Open during factory creation.
Cold-start metrics and registry integration
monitoring/cold_start.go, monitoring/registry.go
Adds ColdStartMetrics (outcome counter + applied-index gap gauge) and a ColdStartObserver implementation; Registry constructs and exposes the observer via ColdStartObserver().
Application startup and test integration
main.go, main_bootstrap_e2e_test.go, main_bootstrap_test.go, multiraft_runtime_test.go
Reorders startup to build the metrics registry before creating the raft factory and passes metricsRegistry.ColdStartObserver() into newRaftFactory; test call sites adjusted to the new signature.
Engine applied-index seeding & advance-only guard
internal/raftengine/etcd/engine.go
Engine seeding now uses the cold-start effective applied index, setApplied only advances strictly forward, and duplicate normal entries are routed to drop-or-replay logic for volatile payloads.
WAL/snapshot restore and skip-gate implementation
internal/raftengine/etcd/wal_store.go
loadWalState/restoreSnapshotState return an effective applied override, decide skip/execute/fallback, verify token/footer/body CRCs for header-only skip, apply header via SnapshotHeaderApplier when supported, and report outcomes via the observer.
Skip-gate and header verification tests
internal/raftengine/etcd/wal_store_skip_gate_test.go
Adds test doubles and tests validating skip/execute/fallback decisioning, WAL post-snapshot rules, header CRC/size verification failures, fallback reasons, and kvFSM header round-trip behavior.
Other test updates
internal/raftengine/etcd/wal_store_test.go, internal/raftengine/etcd/grpc_transport_test.go, internal/raftengine/etcd/encryption_test.go, internal/raftengine/etcd/cold_start_volatile_replay_test.go, monitoring/cold_start_test.go
Updates restoreSnapshotState and applyNormalEntry call signatures, adds volatile-replay tests, and adds observer metric unit tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • bootjp/elastickv#910: Related cold-start skip-gate design and snapshot-header seam implementation.
  • bootjp/elastickv#747: Streaming-token snapshot changes that interact with the updated restoreSnapshotState signature and token CRC handling.
  • bootjp/elastickv#915: Related durable applied-index plumbing and AppliedIndexReader/Writer work that the skip-gate relies upon.

Poem

A rabbit hops through snapshots bright,
Skipping full restores when FSM's just right. 🐰
Headers checked and metrics spun,
Observers count the work we've done. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically describes the main change: implementing a cold-start snapshot-restore skip optimization with CRC verification and metrics.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch impl/snapshot-skip-b3-skip-gate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (ce277e6):

  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 5, 2026

@claude review

Branch 3 of the PR #910 design — the user-visible cold-start perf win. Follows the merged Branch 2 (PR #915 — metaAppliedIndex plumbing). 6 commits, focused review pathways:

Review observation surface:

  • §4 strictly-additive fallback: decideSkipOutcome collapses every uncertainty (FSM not a reader, meta key missing, read error) to a fallback enum; LastAppliedIndex errors MUST NOT abort cold start; tests TestSkipGate_Fallback* cover all three reasons.
  • §5 two-phase seam (round-7): ParseSnapshotHeader(r io.Reader) (ceiling, cutover uint64, err error) + ApplySnapshotHeader(ceiling, cutover uint64). Engine wraps r in a crc32 TeeReader; FSM Parse reads + drains; engine verifies CRC; engine calls Apply ONLY on verification success. The body bytes are read but discarded (fsm.db already has equivalent state).
  • §5 CRC three-step (round-6): size + footer-vs-tokenCRC + full-body-CRC, byte-for-byte matching openAndRestoreFSMSnapshot's safety contract. Tests TestApplyHeaderStateOnSkip_* inject each failure mode and assert FSM state unchanged.
  • §6 nosync mode: not relevant here (B3 reads only; B2 handled the write side).
  • §9 observability: 3 outcomes + gap label + structured INFO log. raftengine.ColdStartObserver interface in the parent package so monitoring can implement without circular import.
  • Compile-time guards: var _ raftengine.SnapshotHeaderApplier = (*kvFSM)(nil) added next to the existing AppliedIndexReader/Writer guards — a rename or signature drift fails go build rather than silently degrading the skip to full-restore-always.
  • Non-Goals respected: Engine.applySnapshot untouched (peer-after-InstallSnapshot bump remains a B4 / follow-up; the TODO sentinel from PR feat(snapshot-skip B2): plumb metaAppliedIndex through raft-Apply + both snapshot persist sites #915 round-6 still names it).
  • Caller audit (per /loop semantic-change rule): restoreSnapshotState signature changed (added obs + logger). All in-package callers updated; 9 test call sites updated to pass (nil, nil) — tests opt out of metrics.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

Claude finished @bootjp's task in 4m 16s —— View job


PR #934 Review — Branch 3: Cold-Start Snapshot-Restore Skip Gate

Todo list:

  • Read changed files and understand the implementation
  • Review decideSkipOutcome / strictly-additive fallback (§4)
  • Review two-phase seam: ParseSnapshotHeader + ApplySnapshotHeader (§5 round-7)
  • Review CRC three-step verification (§5 round-6)
  • Review observability (ColdStartObserver + metrics) (§9)
  • Review compile-time guards
  • Review test coverage
  • TLA+ spec divergence check (kv/fsm.go anchor)
  • Final summary

Overall assessment

The implementation is correct and the core design invariants are faithfully preserved. The three design constraints given in the trigger — strictly-additive fallback (§4), two-phase seam (§5/round-7), and CRC mirroring (§5/round-6) — are all correctly implemented. Four findings below, in descending severity.


Finding 1 — ParseSnapshotHeader error short-circuits before the full-body CRC is verified (design trade-off to document)

Location: internal/raftengine/etcd/wal_store.go:409-415

ceiling, cutover, err := readSnapshotHeaderOrDrain(setter, hasSetter, tee)
if err != nil {
    return err   // ← returns header error; h.Sum32() is never reached
}

if h.Sum32() != footer {
    return errors.Wrapf(ErrFSMSnapshotFileCRC, ...)
}

If ParseSnapshotHeader returns an error (e.g., ErrSnapshotHeaderUnknownMagic), the full-body CRC step is skipped. This is not a safety bug — the header error propagates, so the cold start correctly fails closed and FSM state is not mutated. However, it means a file that passes the step-2 (footer-vs-token) check but has a corrupt header surfaces as a header format error rather than ErrFSMSnapshotFileCRC.

This differs subtly from openAndRestoreFSMSnapshot where restoreAndComputeCRC runs the full-body CRC before fsm.Restore can even report a parse error. The skip path inverts the order: header parse first (inside the CRC pass), CRC checked after. The design deliberately separates parse from apply, so this ordering is intentional. A one-line comment explaining why early-return on parse error is acceptable here would help future readers avoid "fixing" this to match the full-restore flow.


Finding 2 — have == want boundary case is not tested

Location: internal/raftengine/etcd/wal_store_skip_gate_test.go

TestSkipGate_SkipsWhenFSMFreshEnough uses appliedIdx = 200, snapIndex = 100 (well ahead). TestSkipGate_ExecutesWhenFSMStale uses appliedIdx = 100, snapIndex = 200 (well behind). The default branch in decideSkipOutcome fires at have >= want, so have == want is a skip. This boundary case matters for two reasons:

  1. The observer records gap_ahead = have - snapIndex = 0 — a zero gap is valid, but needs explicit confirmation.
  2. The appliedIndexGap gauge would be set to 0, which Prometheus will emit and operators should expect.

[Fix this →](https://claude.ai/code?q=Add%20TestSkipGate_SkipsAtExactBoundary%20to%20internal%2Frightengine%2Fetcd%2Fwal_store_skip_gate_test.go%20on%20branch%20impl%2Fsnapshot-skip-b3-skip-gate.%20Use%20appliedIdx%20%3D%20snapIndex%20%3D%20100%2C%20verify%20restoreSnapshotState%20returns%20nil%2C%20fsm.bodyBytes%20is%20empty%2C%20fsm.restoredHeader%20is%20true%2C%20and%20obs.skipped%20%3D%20%5B%5D uint64%7B0%7D&repo=bootjp/elastickv)


Finding 3 — RestoreFallback discards snapIndex; the appliedIndexGap metric emits nothing for fallback — confirm intentional

Location: monitoring/cold_start.go:87-93

func (o *ColdStartObserver) RestoreFallback(snapIndex uint64, reason string) {
    _ = snapIndex    // ← explicitly discarded
    if o == nil || o.metrics == nil {
        return
    }
    o.metrics.restoreTotal.WithLabelValues("fallback", reason).Inc()
    // no appliedIndexGap update
}

This is correct per the design ("store could not authoritatively name a value"). But operators watching only elastickv_fsm_cold_start_applied_index_gap may see a missing data point and confuse it with a broken metric rather than an expected fallback. A Prometheus absente() alert on outcome=~"skipped|executed" would be the right operational guidance. Consider adding this to the runbook or alerting notes, not to the code.


Finding 4 — TLA+ spec: no divergence

kv/fsm.go anchor (applyHLCLease / SetPhysicalCeiling):

The new ApplySnapshotHeader calls:

if f.hlc != nil && ceiling > 0 {
    f.hlc.SetPhysicalCeiling(int64(ceiling))
}

This is a cold-start-only path, invoked before the Raft event loop starts. It sets the same ceiling that Restore → ReadSnapshotHeader → SetPhysicalCeiling would have set in the full-restore path. The HLC-4 preconditions are unaffected:

  • (i) bounded skew — ceiling is read from the persisted snapshot, not the local wall clock.
  • (ii) logical-counter handoff on leader change — not applicable at cold start; leader election has not happened yet.
  • (iii) commit-time ceiling fence — the skip path issues no writes.

No OCC, MVCC visibility, or routing-catalog invariants are touched. No spec update required.


Minor notes (no action required)

  • //nolint:mnd on 1<<20 in ParseSnapshotHeader (kv/fsm.go): Correct given the cross-package constant boundary — fsmRestoreReadAhead lives in internal/raftengine/etcd which kv cannot import. The inline literal matches the value. ✅

  • fallback_reason label is present but "" for skipped/executed outcomes — This is idiomatic Prometheus (sparse label rather than absent label) and avoids cardinality explosion. Consistent with conventions used elsewhere in the codebase. ✅

  • Seek-back symmetry: applyHeaderStateOnSkip correctly calls file.Seek(0, io.SeekStart) after verifyFSMSnapshotPrefix leaves the file cursor after the footer, mirroring openAndRestoreFSMSnapshot:294. ✅

  • bootstrapNewCluster / persistBootState pass nil observer — correct; the bootstrap snapshot restore is a no-op for fresh clusters, and the comment in persistBootState explicitly explains why no observer is needed. ✅

  • monitoring/hlc_test.go alignment-only diff (column alignment) — no logic change. ✅


Summary

Core correctness is solid. Finding 1 (header-error vs CRC-error ordering) is a design consequence worth documenting inline. Finding 2 (boundary test at have == want) is the only gap in coverage. Finding 3 is an operational note. No spec divergence.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cold-start snapshot-restore skip-gate optimization. It allows skipping the restoration of a multi-GiB snapshot body if the on-disk state machine (FSM) is already as fresh as or fresher than the snapshot. The changes include the definition of the ColdStartObserver and SnapshotHeaderApplier interfaces, integration into the WAL loading and FSM restore paths, implementation of Prometheus metrics for monitoring skip-gate outcomes, and comprehensive unit tests. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce277e6667

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// nil disables metrics; the skip itself still runs. See
// docs/design/2026_06_02_idempotent_snapshot_restore.md §9 and
// internal/raftengine/cold_start.go for the contract.
ColdStartObserver raftengine.ColdStartObserver
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Plumb the cold-start observer through the factory

In the production startup path this observer is never populated: main.go creates monitoring.NewRegistry(...), but engines are opened through internal/raftengine/etcd.Factory.Create, and that method builds OpenConfig without a ColdStartObserver field because raftengine.FactoryConfig has no way to carry it. As a result loadWalState always receives obs == nil in normal server runs, so the new cold-start counters/gauges remain empty even when skips/executions/fallbacks happen. Please thread Registry.ColdStartObserver() through the factory config/open call.

Useful? React with 👍 / 👎.

PR #934 round-1 codex P2: in production, monitoring.NewRegistry's
ColdStartObserver was never reaching the engine because
internal/raftengine/etcd.Factory.Create built OpenConfig solely from the
engine-agnostic raftengine.FactoryConfig (which has no place to carry the
observer). The Branch 3 cold-start counters/gauges therefore stayed at
zero in real runs even when skips/executions/fallbacks happened.

Thread it via the engine-specific etcd.FactoryConfig (already holds
TickInterval/HeartbeatTick/etc.) so the change is local to the etcd
package and main.go — buildShardGroups, buildRuntimeForGroup, and the
shared raftengine.FactoryConfig stay untouched.

Wiring:
- etcd.FactoryConfig gains ColdStartObserver raftengine.ColdStartObserver
- Factory.Create copies it into OpenConfig{}
- newRaftFactory in main.go now takes the observer arg and passes
  metricsRegistry.ColdStartObserver() through
- moved the factory construction below metricsRegistry creation so the
  observer is available at factory build time

Tests:
- 3 existing newRaftFactory callers updated to pass nil (no metrics
  needed in those test scopes)
- TestSkipGate_*/TestApplyHeaderStateOnSkip_*/TestSkipGate_KVFSMHeaderRoundTrip
  (9 tests) still pass — they exercise restoreSnapshotState directly with
  a recording observer test double, unaffected by the factory plumbing

Verified: go vet ./..., golangci-lint run ./internal/raftengine/etcd/...
./monitoring/... ./internal/raftengine/... ./., and
go test -count=1 ./internal/raftengine/etcd/ all clean.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 5, 2026

@claude review

Round-1 codex P2 addressed (commit 320d06e):

monitoring.Registry.ColdStartObserver() now flows through to OpenConfig.ColdStartObserver in production:

  • etcd.FactoryConfig gained ColdStartObserver raftengine.ColdStartObserver
  • Factory.Create copies it into the OpenConfig{} literal at factory.go:68
  • main.go reorders so metricsRegistry is built before newRaftFactory, then passes metricsRegistry.ColdStartObserver() in
  • Three newRaftFactory callers in tests pass nil (those scopes do not exercise metrics)

Kept the change inside the engine-specific etcd.FactoryConfig rather than the shared raftengine.FactoryConfig to avoid touching buildShardGroups, buildRuntimeForGroup, and unrelated test wiring.

Verified: go vet ./..., golangci-lint run (0 issues across raftengine/etcd, monitoring, raftengine, root), go test ./internal/raftengine/etcd/ green, all 9 TestSkipGate_* / TestApplyHeaderStateOnSkip_* still pass.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (320d06e):

  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

Claude finished @bootjp's task in 2m 33s —— View job


PR #934 Round-2 Review — ColdStartObserver factory wiring (commit 320d06e)

Todo list:

  • Load tools
  • Read changed files (factory.go, main.go, engine.go, cold_start.go, registry.go)
  • Verify ColdStartObserver flows correctly end-to-end in production path
  • Check test callers pass nil cleanly
  • Re-verify Round-1 findings status (boundary test, header-error comment)
  • Post final review

Codex P2 — Closed correctly ✓

The full wiring chain is intact:

Step Location Verified
NewRegistry builds coldStartObs monitoring/registry.go:57
Registry.ColdStartObserver() returns it monitoring/registry.go:65-70
metricsRegistry constructed before factory call main.go:323 vs main.go:331
newRaftFactory takes coldStartObs raftengine.ColdStartObserver main.go:53
FactoryConfig.ColdStartObserver carries it factory.go:26
Factory.Create copies into OpenConfig{ColdStartObserver: ...} factory.go:68
Open passes cfg.ColdStartObserver to loadWalState wal_store.go:43
loadWalStaterestoreSnapshotStatereportColdStart wal_store.go:135, 264
reportColdStart guards every dispatch with obs != nil wal_store.go:329,340,354

Test callers: main_bootstrap_test.go:94, main_bootstrap_e2e_test.go:354, multiraft_runtime_test.go:54,109 all pass nil correctly — tests don't exercise metrics.

Bootstrap recursive path (wal_store.go:527): passes nil with an explanatory comment — this is the fresh-bootstrap path that runs before OpenConfig wiring, and a fresh-bootstrap restore is a no-op anyway. Correct.

Nil-interface safety: reportColdStart guards with obs != nil (interface-level check), which is the right check for an interface type. *ColdStartObserver methods additionally check o == nil for the concrete receiver, providing defence-in-depth if the observer is ever passed as a non-nil interface wrapping a nil pointer. ✓


One minor note (no action required)

factory.go:24 and main.go:325-330 both reference "PR #934 round-1 codex P2". Per the codebase's comment convention, PR references in inline comments rot as the codebase evolves; the stable reference is docs/design/2026_06_02_idempotent_snapshot_restore.md §9. The information is accurate now but won't age well. Suggest replacing with the design-doc reference on cleanup, though it's not blocking.


Round-1 findings status

Finding Status
F1 — document header-error-before-CRC ordering in applyHeaderStateOnSkip Open (minor, documentation only)
F2 — add TestSkipGate_SkipsAtExactBoundary (have == want, gap=0) Open (test completeness)
F3 — operational note on appliedIndexGap absence for fallback Acknowledged, no code change needed
F4 — TLA+ spec: no divergence Confirmed
Codex P2 — ColdStartObserver not plumbed in production path Closed by 320d06e

Overall

P2 is closed correctly. The change is minimal and surgical — only FactoryConfig, Factory.Create, and the main initialisation order needed touching, which is exactly the right scope. No new issues introduced. The two remaining open findings (F1 comment, F2 boundary test) are cosmetic/test-completeness items that don't affect correctness or safety.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 320d06efb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/raftengine/etcd/wal_store.go Outdated
Comment on lines +264 to +267
decision, have := decideSkipOutcome(fsm, tok.Index)
reportColdStart(obs, logger, decision, tok.Index, have)
if decision == coldStartSkip {
return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent skipping when WAL will replay newer entries

When the FSM has applied entries after the latest persisted snapshot (the normal interval between snapshots, since metaAppliedIndex advances on each Apply), LastAppliedIndex() > tok.Index takes this skip path even though Open still initializes the raft applied floor from LocalSnap.Metadata.Index and openAndReadWAL reloads the post-snapshot entries. Those entries are then replayed into a Pebble store that already contains them; transaction replays can hit OCC conflicts before kvFSM observes their commit timestamps, leaving the restarted node’s HLC below timestamps already present on disk. The skip needs to advance/suppress replay through have, or only skip when there are no post-snapshot WAL entries to replay.

Useful? React with 👍 / 👎.

… (codex P1 #934)

When the FSM has applied entries past the latest persisted snapshot
(the normal interval between snapshots — metaAppliedIndex advances on
each Apply), the prior skip path fired on LastAppliedIndex() >
tok.Index even though Open still initializes raft from
LocalSnap.Metadata.Index and openAndReadWAL replays the post-snapshot
WAL entries. Those entries (tok.Index+1 .. have) would re-apply onto
a Pebble store that already contains them, hitting OCC conflicts in
the transaction commit path before kvFSM observes the commit
timestamps, and leaving the HLC ceiling below timestamps already on
disk.

Fix: reorder loadWalState so the WAL is opened BEFORE the skip-gate
decision. Compute lastWalIndex (max entry Index in the WAL tail, or
snapshot.Metadata.Index when empty) and gate skip on
'have >= lastWalIndex' instead of 'have >= tok.Index'. When the WAL
still carries entries beyond have, the gate falls through to the
full-restore path, which then runs through the normal apply path
without double-applies.

Signature: restoreSnapshotState gains a lastWalIndex uint64
parameter. reportColdStart now takes (snapIndex, target, have) so
the structured-log fields capture both 'snapshot_index' (original
tok.Index for operator context) and 'last_wal_index' (the
threshold actually used), and gap_ahead / gap_behind are
computed against the threshold.

Caller audit (signature change):
- restoreSnapshotState production caller: loadWalState only. Updated.
- Test call sites in wal_store_test.go (8), wal_store_skip_gate_test.go
  (5), grpc_transport_test.go (1) all updated to pass
  snap.Metadata.Index as the third argument — preserves the original
  test semantics where the threshold equals the snapshot index.
- reportColdStart is private to this file; single call site updated.

Regression: TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries -
snap.Index=100, fsm.applied=150, lastWalIndex=200; the skip path
would previously fire (have>snapIndex) but must now execute because
have<lastWalIndex. Verifies fsm.bodyBytes (full restore ran),
observer.executed=[lastWalIndex-have], and no skip recorded.

Observer interface unchanged — RestoreSkipped/RestoreExecuted still
take (snapIndex, have), but the value passed for 'snapIndex' is now
the threshold (lastWalIndex). This keeps the metric semantically
meaningful as 'gap relative to where the WAL replay would advance to'.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (57467ed):

  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 7, 2026

@claude review

Addresses codex P1 #934 (internal/raftengine/etcd/wal_store.go) — cold-start skip gate would re-apply post-snapshot WAL entries.

Bug: when FSM applied > tok.Index (normal interval between snapshots — metaAppliedIndex advances each Apply), the skip path fired on have >= tok.Index. But Open still initializes raft's applied floor from LocalSnap.Metadata.Index and openAndReadWAL replays the post-snapshot WAL entries. Those entries (tok.Index+1 .. have) re-applied onto a Pebble store that already contained them → OCC conflicts in transaction commit + HLC ceiling left below timestamps already on disk.

Fix: reorder loadWalState so the WAL is opened BEFORE the skip-gate. Compute lastWalIndex (max entry Index in the WAL tail, or snapshot.Metadata.Index when empty) and gate skip on have >= lastWalIndex. When the WAL still carries entries beyond have, the gate falls through to full restore which runs through the normal apply path without double-applies.

Caller audit: restoreSnapshotState gains lastWalIndex uint64 param. Single production call site (loadWalState) + 14 test call sites updated. reportColdStart extended to log both snapshot_index (operator context) and last_wal_index (actual threshold).

Regression: TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries — snap=100, applied=150, lastWalIndex=200; would previously skip (have>snapIndex) but must execute because have<lastWalIndex.

go test + golangci-lint clean on ./internal/raftengine/etcd/.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 7, 2026

Claude finished @bootjp's task in 9m 47s —— View job


PR #934 Round-3 Review — Gate cold-start skip on lastWalIndex (codex P1, commit 57467ed)

Todo list:

  • Load tools
  • Read wal_store.go (skip gate + loadWalState reorder)
  • Read wal_store_skip_gate_test.go (new regression test)
  • Verify fix correctness (WAL open-before-skip, lastWalIndex computation)
  • Check edge cases (empty WAL, exact boundary, WAL == snapIndex)
  • Verify reportColdStart extension (snapshot_index + last_wal_index logging)
  • TLA+ spec: kv/fsm.go anchor — no new HLC/OCC divergence
  • Post review

Overall assessment

The fix is correct. The gate change from have >= tok.Index to have >= lastWalIndex closes a genuine invariant violation. The underlying assumption is sound: kvFSM.Apply is fully idempotent — transaction commits via commitApplyStartTS + applyCommitWithIdempotencyFallback (kv/fsm.go:956,994), and raw writes via the strict ts > startTS conflict check (store/lsm_store.go:1305) mean re-delivering WAL entries tok.Index+1..lastWalIndex after a skip causes no OCC conflicts. Three findings below.


Finding 1 — ColdStartObserver interface docstring stale after semantic change (Minor)

Locations: internal/raftengine/cold_start.go:12-24, monitoring/cold_start.go:37-41,63-74

The interface was added when the skip gate compared against tok.Index. Now RestoreSkipped / RestoreExecuted receive target = lastWalIndex, not snapshot.Metadata.Index. Three places still describe the gap as relative to the snapshot pointer:

internal/raftengine/cold_start.go (parameter name + doc):

// RestoreSkipped: `gap = haveAppliedIndex -
//     snapshot.Metadata.Index` (how far ahead the live store was).
RestoreSkipped(snapIndex, haveAppliedIndex uint64)  // ← "snapIndex" is misleading; value is lastWalIndex

monitoring/cold_start.go (Prometheus Help string, line 37):

Help: "…outcome=skipped means the gate fired (LastAppliedIndex >= snapshot.Index)…"
// Should be: LastAppliedIndex >= lastWalIndex (WAL tail)

monitoring/cold_start.go (method comments, lines 63-74):

// RestoreSkipped records a successful skip. gap = have - snapIndex
// RestoreExecuted …gap = snapIndex - have
// Both are correct in arithmetic, but "snapIndex" is now the WAL tail, not the snapshot pointer.

Operators reading the Prometheus metric description will infer "distance from snapshot" when the true meaning is "distance from WAL tail." When lastWalIndex >> tok.Index, the gap is materially smaller than what the description implies.

Fix this →


Finding 2 — Regression test covers execute path but not the skip path with non-empty WAL (Minor)

Location: internal/raftengine/etcd/wal_store_skip_gate_test.go

The new TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries (snap=100, applied=150, lastWalIndex=200) correctly pins the bug scenario. However the skip path is still only tested with an empty WAL tail (lastWalIndex = snap.Metadata.Index in TestSkipGate_SkipsWhenFSMFreshEnough).

The common production path — WAL has entries AND the skip fires — is untested:

snap.Index=100, applied=200, lastWalIndex=200   ← WAL entries 101..200 exist; have≥lastWalIndex → skip

This should verify: applyHeaderStateOnSkip ran, fsm.bodyBytes is empty, obs.skipped = [200-200=0]. The gap-zero case also exercises have-target=0 in reportColdStart (the underflow-safe path).

(This is the same test requested in Round-1 F2, now with a clearer motivation since the skip path with non-empty WAL is the primary production benefit of the fix.)

Fix this →


Finding 3 — Correct but worth noting: empty-WAL and non-empty-WAL skip gates are now semantically unified

loadWalState:150-153lastWalIndex initialises to snapshot.Metadata.Index and is overridden only when WAL entries are present:

lastWalIndex := snapshot.Metadata.Index
if n := len(entries); n > 0 && entries[n-1].Index > lastWalIndex {
    lastWalIndex = entries[n-1].Index
}

When entries are empty, lastWalIndex = tok.Index, which means restoreSnapshotState(…, tok.Index, …) — identical to the pre-P1 semantics for the zero-WAL case. ✓ Existing test semantics preserved.


WAL close on error path — clean ✓

The new error path at lines 154-162 correctly closes the WAL handle when restoreSnapshotState fails. This matches the existing pattern used after newMemoryStorage failure at line 170-176. No resource leak. ✓


TLA+ spec — no divergence ✓

kv/fsm.go anchor: the changes in this commit are confined to loadWalState / restoreSnapshotState / decideSkipOutcome / reportColdStart in wal_store.go. No HLC issuance path, OCC commit-ts ordering, MVCC visibility, or routing-catalog logic is touched. ApplySnapshotHeader and ParseSnapshotHeader are unchanged. HLC-4 preconditions (i)(ii)(iii) remain intact. No spec update required.


Round summary

Finding Severity Status
F1ColdStartObserver docstrings + Prometheus Help stale (snapIndex → WAL tail) Minor Open
F2 — skip path with non-empty WAL untested (have=lastWalIndex>tok.Index) Minor Open
F3 — WAL close on error path n/a ✓ Clean
TLA+ spec divergence n/a ✓ None
Codex P1 core fix ✓ Correct

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 57467ed085

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/raftengine/etcd/wal_store.go Outdated
Comment on lines +290 to +293
decision, have := decideSkipOutcome(fsm, lastWalIndex)
reportColdStart(obs, logger, decision, tok.Index, lastWalIndex, have)
if decision == coldStartSkip {
return applyHeaderStateOnSkip(fsm, fsmSnapPath(fsmSnapDir, tok.Index), tok.CRC32C)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Advance Raft's applied floor when skipping the restore

When a persisted snapshot is at index S and the FSM has already applied through the WAL tail T, this new condition selects the skip path, but Open still initializes Engine.applied and the RawNode applied floor from LocalSnap.Metadata.Index (S). Consequently Raft emits the committed entries S+1..T again and applyCommitted replays them against the already-advanced Pebble state, which can produce OCC failures or repeat other non-idempotent FSM effects. The new have >= lastWalIndex gate is fresh evidence beyond the earlier review: it guarantees that every post-snapshot WAL entry is already represented in the FSM without adding any corresponding mechanism to suppress its replay.

Useful? React with 👍 / 👎.

Comment thread internal/raftengine/etcd/wal_store.go Outdated
Comment on lines +354 to +356
case coldStartSkip:
if obs != nil {
obs.RestoreSkipped(target, have)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Report snapshot-relative gaps to the observer

Pass snapIndex here rather than target: ColdStartObserver.RestoreSkipped defines its first argument as the snapshot index, and the Prometheus implementation calculates have - snapIndex, but this call supplies lastWalIndex. Whenever the snapshot trails the WAL, the exported gauge therefore reports distance from the WAL tail (often zero) while its help text claims distance from the snapshot pointer; the executed callback below has the same argument mismatch.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
internal/raftengine/etcd/wal_store_skip_gate_test.go (3)

85-95: ⚡ Quick win

Misleading parameter names in test double.

The recordingObs methods name the first parameter snapIndex, but it receives target (lastWalIndex) from the actual implementation (see reportColdStart in wal_store.go:354-386). This is confusing because target and snapIndex can differ—when the WAL carries post-snapshot entries, target > snapIndex. The gap calculations happen to work because the tests pass snap.Metadata.Index as lastWalIndex, making them equal, but the misleading names make the test harder to maintain.

♻️ Rename parameters to match the interface contract
-func (o *recordingObs) RestoreSkipped(snapIndex, have uint64) {
+func (o *recordingObs) RestoreSkipped(target, have uint64) {
 	o.mu.Lock()
 	defer o.mu.Unlock()
-	o.skipped = append(o.skipped, have-snapIndex)
+	o.skipped = append(o.skipped, have-target)
 }
 
-func (o *recordingObs) RestoreExecuted(snapIndex, have uint64) {
+func (o *recordingObs) RestoreExecuted(target, have uint64) {
 	o.mu.Lock()
 	defer o.mu.Unlock()
-	o.executed = append(o.executed, snapIndex-have)
+	o.executed = append(o.executed, target-have)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/wal_store_skip_gate_test.go` around lines 85 - 95,
Rename the parameters in the recordingObs test double so they reflect the
interface contract: change the first parameter name from snapIndex to target (or
lastWalIndex) in both RestoreSkipped and RestoreExecuted to match how
reportColdStart in wal_store.go calls them; update any internal calculations or
variable names in RestoreSkipped and RestoreExecuted to use target and snapIndex
appropriately (e.g., compute gaps as have - target or target - have only where
semantically correct) so the test double’s signature and names align with the
real invocation and avoid confusion.

117-119: 💤 Low value

Extract repeated header length constant.

The header length 16 appears in multiple tests (here and line 327) and in the test double (line 59). Consider extracting it as a package constant for maintainability.

♻️ Extract constant

Add near the top of the file:

const (
	testV1HeaderLen = 16 // EKVTHLC1 (8 bytes) + ceiling BE uint64 (8 bytes)
)

Then replace the literals:

-	payload := make([]byte, 16)
+	payload := make([]byte, testV1HeaderLen)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/wal_store_skip_gate_test.go` around lines 117 - 119,
Extract the repeated header length literal by adding a package-level constant
(e.g., testV1HeaderLen = 16) near the top of
internal/raftengine/etcd/wal_store_skip_gate_test.go and replace all literal
uses (for example the make([]byte, 16) in the payload construction, any
copy/slice ops assuming 16 bytes, and the test double referenced around the
earlier helper) with that constant; update references such as the payload
creation (payload := make([]byte, 16)), any index/slice bounds that assume the
16-byte header, and the test double setup to use testV1HeaderLen so the header
size is maintained in one place.

293-315: 💤 Low value

Consider adding clarifying comment for intentional corruption.

The bit flip at line 307 is intentional test corruption, which is clear from context, but a brief comment would make the intent immediately obvious to future readers.

📝 Add clarifying comment
 	var b [1]byte
 	_, err = f.ReadAt(b[:], 0)
 	require.NoError(t, err)
+	// Flip a bit to simulate corruption
 	b[0] ^= 0x01
 	_, err = f.WriteAt(b[:], 0)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/wal_store_skip_gate_test.go` around lines 293 - 315,
Add a short clarifying comment immediately above the intentional bit-flip (the
line with b[0] ^= 0x01) in TestApplyHeaderStateOnSkip_BodyCorruption to state
that the mutation is deliberate test corruption to ensure applyHeaderStateOnSkip
detects CRC mismatch; reference the variables crc, path and the expectation
ErrFSMSnapshotFileCRC and note that FSM state must not change
(fsm.restoredHeader) so reviewers understand the intent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/raftengine/etcd/wal_store.go`:
- Around line 352-369: In reportColdStart, the ColdStartObserver callbacks are
called with the wrong first argument (passing target/lastWalIndex); change the
calls in reportColdStart to pass snapIndex as the first parameter so
RestoreSkipped and RestoreExecuted are invoked as (snapshotIndex,
haveAppliedIndex) — update the obs.RestoreSkipped(...) and
obs.RestoreExecuted(...) invocations in the reportColdStart function
accordingly.
- Around line 290-295: The current code calls reportColdStart before performing
the actual restore/skip, which records metrics even if applyHeaderStateOnSkip or
openAndRestoreFSMSnapshot fail; move the reportColdStart call so it runs only
after the chosen path successfully completes: call decideSkipOutcome(fsm,
lastWalIndex) to get decision/have, then perform either
applyHeaderStateOnSkip(fsm, fsmSnapPath(...), tok.CRC32C) or
openAndRestoreFSMSnapshot(...), check their returned error is nil, and only then
invoke reportColdStart(obs, logger, decision, tok.Index, lastWalIndex, have);
ensure both branches (skip and restore) return/report only on success and
propagate errors unchanged.

---

Nitpick comments:
In `@internal/raftengine/etcd/wal_store_skip_gate_test.go`:
- Around line 85-95: Rename the parameters in the recordingObs test double so
they reflect the interface contract: change the first parameter name from
snapIndex to target (or lastWalIndex) in both RestoreSkipped and RestoreExecuted
to match how reportColdStart in wal_store.go calls them; update any internal
calculations or variable names in RestoreSkipped and RestoreExecuted to use
target and snapIndex appropriately (e.g., compute gaps as have - target or
target - have only where semantically correct) so the test double’s signature
and names align with the real invocation and avoid confusion.
- Around line 117-119: Extract the repeated header length literal by adding a
package-level constant (e.g., testV1HeaderLen = 16) near the top of
internal/raftengine/etcd/wal_store_skip_gate_test.go and replace all literal
uses (for example the make([]byte, 16) in the payload construction, any
copy/slice ops assuming 16 bytes, and the test double referenced around the
earlier helper) with that constant; update references such as the payload
creation (payload := make([]byte, 16)), any index/slice bounds that assume the
16-byte header, and the test double setup to use testV1HeaderLen so the header
size is maintained in one place.
- Around line 293-315: Add a short clarifying comment immediately above the
intentional bit-flip (the line with b[0] ^= 0x01) in
TestApplyHeaderStateOnSkip_BodyCorruption to state that the mutation is
deliberate test corruption to ensure applyHeaderStateOnSkip detects CRC
mismatch; reference the variables crc, path and the expectation
ErrFSMSnapshotFileCRC and note that FSM state must not change
(fsm.restoredHeader) so reviewers understand the intent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ddfbe851-5668-47b2-a0e9-bfcd9c549572

📥 Commits

Reviewing files that changed from the base of the PR and between d23cf69 and 57467ed.

📒 Files selected for processing (16)
  • internal/raftengine/cold_start.go
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/factory.go
  • internal/raftengine/etcd/grpc_transport_test.go
  • internal/raftengine/etcd/wal_store.go
  • internal/raftengine/etcd/wal_store_skip_gate_test.go
  • internal/raftengine/etcd/wal_store_test.go
  • internal/raftengine/statemachine.go
  • kv/fsm.go
  • kv/fsm_applied_index_iface_check.go
  • main.go
  • main_bootstrap_e2e_test.go
  • main_bootstrap_test.go
  • monitoring/cold_start.go
  • monitoring/registry.go
  • multiraft_runtime_test.go

Comment thread internal/raftengine/etcd/wal_store.go Outdated
Comment thread internal/raftengine/etcd/wal_store.go Outdated
…bit Major #934)

Four findings on review 4434778631 + the coderabbit review on the
same head. Round-2 fix for the codex P1 plus the three Major / P2
items collected:

1. Codex P1: skip path didn't advance Raft's applied floor. Even
   with the lastWalIndex gate (which guarantees the FSM has applied
   through the WAL tail), Engine.Open still seeded e.applied from
   maxAppliedIndex(LocalSnap) = snapshot.Metadata.Index. Raft then
   delivered entries snap.Index+1..have to applyCommitted and
   replayed them onto a Pebble store already containing them - OCC
   conflicts in the kvFSM commit path; HLC ceiling inversion.

   Fix: restoreSnapshotState now returns (effectiveApplied uint64,
   error). diskState carries EffectiveApplied. Engine.Open seeds
   e.applied and engine.appliedIndex via coldStartApplied(disk) =
   max(maxAppliedIndex(LocalSnap), EffectiveApplied).

   Belt+braces: applyNormalCommitted skips entries with
   Index <= e.applied (idempotent no-op); setApplied is now
   advance-only so conf changes between snap.Index+1 and have do
   not regress the counter.

2. Codex P2 / coderabbit Major (observer args): the ColdStartObserver
   contract is (snapshotIndex, haveAppliedIndex) - monitoring/cold_start.go
   computes have - snapIndex for the gauge. Round-1 passed
   target=lastWalIndex instead, breaking the metric baseline. Revert
   to passing snapIndex. The 'target' / lastWalIndex is preserved as
   a structured-log field for operator context.

3. Coderabbit Major (emit-after-success): reportColdStart used to
   fire BEFORE applyHeaderStateOnSkip / openAndRestoreFSMSnapshot,
   so a CRC/header failure would still register a 'successful'
   outcome. Moved to after the restore-side work succeeds.

Caller audit (signature + semantic):
- restoreSnapshotState: now returns (uint64, error). Single
  production caller (loadWalState) plumbs into diskState.EffectiveApplied.
  14 test call sites updated.
- setApplied: advance-only. Callers are all in the apply loop
  (applyNormalCommitted, applyConfChangeCommitted, applyConfChangeV2Committed,
  applyCommitted's default case). The new gate prevents regression
  from conf changes between snap.Index+1 and have, which is exactly
  the case the codex P1 fix relies on.
- coldStartApplied: new helper, called from Engine.Open only.

Extracted restoreSnapshotStateFromToken to keep restoreSnapshotState
under the nestif limit.

New regression tests in wal_store_skip_gate_test.go:
- TestSkipGate_ReturnsEffectiveAppliedOnSkip: skip path returns
  have, not snap.Metadata.Index.
- TestSkipGate_EmitsAfterSuccess: CRC mismatch does NOT register
  obs.skipped / obs.executed (metrics emit only on success).

go test + golangci-lint clean on ./internal/raftengine/etcd/.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (c9b7c3b):

  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 7, 2026

@claude review

Addresses all 4 findings on review 4434778631 + the coderabbit review on the same head:

  1. Codex P1 (wal_store.go:293) — advance Raft's applied floor on skip. Round-1 fix gated skip on have >= lastWalIndex but Engine.Open still seeded e.applied from maxAppliedIndex(LocalSnap) = snap.Index. Raft would still deliver entries snap.Index+1..have to applyCommitted and replay them onto a Pebble store already containing them.

    Fix: restoreSnapshotState now returns (effectiveApplied uint64, error). diskState carries EffectiveApplied. Engine.Open uses coldStartApplied(disk) = max(maxAppliedIndex(LocalSnap), EffectiveApplied) to seed e.applied and the atomic mirror. Belt+braces: applyNormalCommitted skips entries with Index <= e.applied (idempotent no-op); setApplied is now advance-only so conf changes in [snap.Index+1, have] don't regress the counter.

  2. Codex P2 / coderabbit Major (observer args). ColdStartObserver contract is (snapshotIndex, haveAppliedIndex) — Prometheus computes have - snapIndex. Round-1 passed target=lastWalIndex instead. Reverted; lastWalIndex stays as a structured-log field.

  3. Coderabbit Major (emit-after-success). reportColdStart moved to after applyHeaderStateOnSkip / openAndRestoreFSMSnapshot succeeds — CRC/header failures no longer register a 'successful' outcome.

Caller audit: restoreSnapshotState signature change (14 test sites updated). setApplied advance-only audited against all 4 apply paths. coldStartApplied new helper, single caller.

Regression: TestSkipGate_ReturnsEffectiveAppliedOnSkip (skip returns have, not snap.Index) + TestSkipGate_EmitsAfterSuccess (CRC failure does not record skipped/executed metrics).

go test + golangci-lint clean on ./internal/raftengine/etcd/.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

…e (codex P2 #934 round 3)

Two findings on review 4434778631's c9b7c3b head:

1. Codex P2 (wal_store.go:160) — gate on the committed index. The
   round-2 fix used entries[n-1].Index (the WAL tail) as the skip
   threshold, but a follower's WAL can carry an UNCOMMITTED suffix
   that raft will not surface via Ready.CommittedEntries until the
   leader confirms it. If the FSM is applied through the committed
   index but trails that suffix, the gate forced a multi-GiB
   restore on every restart of any such follower, defeating the
   cold-start optimization.

   Fix: extract coldStartSkipThreshold(snapshot, hardState) =
   max(snapshot.Metadata.Index, hardState.Commit). Drops the
   walTail term — raft's invariant guarantees Commit ≤ walTail, so
   the walTail comparison was redundant for safety and harmful for
   the follower-with-uncommitted-suffix optimization.

2. Codex P2 (monitoring/cold_start.go:80) — avoid underflowing the
   executed gauge. After the round-1 gate change, an executed
   restore no longer implies have < snapIndex: a crashed follower
   can have FSM ahead of the snapshot pointer but behind the
   committed tail. The previous snapIndex - have subtraction
   underflowed into ~2^64, corrupting the gauge.

   Fix: ColdStartObserver.RestoreExecuted computes the absolute
   distance |snapIndex - have|. Comment updated to explain the new
   semantics (executed no longer implies stale FSM in the literal
   sense; it means full restore ran because gate did not skip).

Caller audit:
- coldStartSkipThreshold: pure helper, called once from
  loadWalState. No production-side downstream calls.
- ColdStartObserver.RestoreExecuted: signature unchanged; behavior
  change is just the absolute-value computation. Callers
  (engine cold-start path) pass the same arguments.

Regression tests:
- TestColdStartSkipThreshold (wal_store_skip_gate_test.go): 4-case
  table — empty WAL, hardState below snapshot (pre-replay),
  hardState above snapshot, hardState zero.
- TestColdStartObserver_ExecutedAbsoluteGap (new monitoring/cold_start_test.go):
  3-case table — FSM behind / FSM ahead / FSM equal.
- TestColdStartObserver_SkippedGap: skipped-path baseline.
- TestColdStartObserver_NilSafe: nil observer is a no-op.
- TestColdStartObserver_FallbackReasonLabel: 3-reason
  table covering not_reader / missing_meta / read_err.

go test + golangci-lint clean on ./monitoring/ + ./internal/raftengine/etcd/.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (1b166a5):

  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 7, 2026

@claude review

Addresses codex P2 #934 round 3 (review 4434778631 → 2 new inline findings):

  1. wal_store.go:160 — gate on the committed index. Round-2 used entries[n-1].Index (WAL tail) as the skip threshold. A follower's WAL can carry an UNCOMMITTED suffix that raft will not deliver via Ready.CommittedEntries until the leader confirms it. Followers with such suffixes were forced into a full restore on every restart.

    Fix: new coldStartSkipThreshold(snapshot, hardState) = max(snapshot.Metadata.Index, hardState.Commit). Raft's invariant guarantees Commit ≤ walTail, so the walTail comparison was redundant for safety and harmful for the follower optimization.

  2. monitoring/cold_start.go:80 — avoid underflowing the executed gauge. Round-1's gate change means executed restores no longer imply have < snapIndex (crashed follower can have FSM ahead of snapshot but behind committed tail). The snapIndex - have subtraction underflowed into ~2^64.

    Fix: RestoreExecuted computes |snapIndex - have| (absolute distance). Comment updated.

Regression tests:

  • TestColdStartSkipThreshold — 4-case table for helper.
  • TestColdStartObserver_ExecutedAbsoluteGap — 3-case table covering FSM behind / FSM ahead / FSM equal.
  • Plus TestColdStartObserver_SkippedGap, _NilSafe, _FallbackReasonLabel for monitoring coverage.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 7, 2026

Claude finished @bootjp's task in 3m 2s —— View job


PR #934 Round-5 Review — Committed-index gate + executed-gauge underflow fix (commit 1b166a5)

Todo list:

  • Load tools
  • Read wal_store.gocoldStartSkipThreshold helper + usage
  • Read monitoring/cold_start.go — absolute gap fix in RestoreExecuted
  • Read monitoring/cold_start_test.go — new monitoring tests
  • Read wal_store_skip_gate_test.goTestColdStartSkipThreshold regression test
  • Verify correctness of committed-index gate (codex P2 Configure Renovate #1)
  • Verify underflow fix in RestoreExecuted (codex P2 Update dependency go to v1.21.4 #2)
  • Check edge cases (Commit=0, Commit < snapIndex, uncommitted suffix)
  • TLA+ spec: kv/fsm.go anchor — no new HLC/OCC divergence
  • Post final review

Overall assessment

Both fixes are correct. The coldStartSkipThreshold helper picks the right comparison basis (max(snapshot.Metadata.Index, hardState.Commit)) and the RestoreExecuted absolute-gap fix closes a genuine uint64 underflow. Three findings below.


Finding 1 — recordingObs.RestoreExecuted uses snapIndex - have (no abs), while the production impl now uses |snapIndex - have| — test assertions are technically undefined behaviour

Location: internal/raftengine/etcd/wal_store_skip_gate_test.go:91-95, 165

// test double (line 91-95)
func (o *recordingObs) RestoreExecuted(snapIndex, have uint64) {
    o.executed = append(o.executed, snapIndex-have)  // can underflow
}

// assertion (line 165): snapIndex=100, appliedIdx=150
require.Equal(t, []uint64{snapIndex - appliedIdx}, obs.executed)
// both sides compute 100-150 = 18446744073709551466 (wraps equally → test passes)

The test passes only because both sides compute the same wrapped value, not because the semantically correct gap of 50 was verified. The monitoring.ColdStartObserver.RestoreExecuted was fixed to use |snapIndex - have|, but recordingObs was not updated. A future test adding assert.Equal(t, uint64(50), obs.executed[0]) for the FSM-ahead-of-snapshot scenario would fail unexpectedly.

Fix this →


Finding 2 — gap_ahead log field uses have - target (committed tail) while the metric uses have - snapIndex — operators see different baselines

Location: internal/raftengine/etcd/wal_store.go:444

case coldStartSkip:
    if obs != nil {
        obs.RestoreSkipped(snapIndex, have)  // metric gap = have - snapIndex
    }
    if logger != nil {
        logger.Info("restoreSnapshotState skipped",
            zap.Uint64("gap_ahead", have-target),  // log gap = have - hardState.Commit

When hardState.Commit > snapshot.Metadata.Index (post-replay nodes), these two values differ. An operator correlating the log against the Prometheus metric will see different magnitudes. The metric is per-design (snapshot-relative gap); the log would be more consistent if it also logged have-snapIndex as gap_ahead and kept target as a separate last_committed_index field.

This is a minor observability concern, not a correctness issue. Consider either aligning the log to have-snapIndex, or explicitly naming the fields to clarify both:

zap.Uint64("gap_ahead_snapshot",   have-snapIndex),   // what the metric measures
zap.Uint64("gap_ahead_committed",  have-target),      // how far past the WAL committed tail

Finding 3 — coldStartSkipThreshold Commit < snapIndex case: defensive but realistic in one crash window (no action required)

Location: internal/raftengine/etcd/wal_store.go:211-216 + test case "hardState.Commit below snapshot (pre-replay)"

The test covers hardState.Commit = 50, snapshot.Metadata.Index = 100 → threshold = 100. In practice raft guarantees Commit >= snapshot.Index for any persisted snapshot, but the guard still makes the function safe against a crash window where the snapshot was flushed before the HardState update reached fsync. Returning snapIndex is strictly safer than returning Commit = 50 (would incorrectly allow skip even though raft hasn't committed those entries). Correct as written.


setApplied advance-only + conf-change re-delivery — correct ✓

applyNormalCommitted has the entry.Index <= e.applied early-return guard, preventing re-delivery of [snap.Index+1, have] normal entries. Conf changes in the same range go through applyConfChangeCommitted without the guard — re-running ApplyConfChange / persistConfigState / applyConfigChange is idempotent (membership updates, Pebble config-state write). setApplied advance-only guard prevents the counter from regressing when the re-delivered conf-change entry calls setApplied(old_index). Design-correct.


TLA+ spec — no divergence ✓

kv/fsm.go anchor: changes in this commit are confined to coldStartSkipThreshold in wal_store.go, RestoreExecuted in monitoring/cold_start.go, and test files. No HLC issuance path, no OCC commit-ts logic, no MVCC visibility, no route-catalog wiring touched. HLC-4 preconditions (i)(ii)(iii) unaffected. No spec update required.


Round summary

Round Finding Severity Status
R1-F1 Header-error-before-CRC order doc Minor Open (doc-only)
R1-F2 have == want boundary test Minor Superseded by TestColdStartSkipThreshold + committed-index tests
R3-F1 Observer docstrings stale (snapIndex → WAL tail) Minor Closed — args reverted to snapIndex
R3-F2 Skip path with non-empty WAL untested Minor ClosedTestColdStartSkipThreshold + restructured gate covers it ✓
R5-F1 recordingObs.RestoreExecuted underflow, hides correct abs-gap assertion Minor Open
R5-F2 gap_ahead log uses committed-tail baseline, metric uses snapshot baseline Minor Open
Codex P2 #1 Gate on hardState.Commit not WAL tail ✓ Correct
Codex P2 #2 RestoreExecuted underflow in gauge ✓ Correct
TLA+ spec No divergence ✓ None

The two remaining open items are minor quality issues; neither affects production correctness or the soak-plan metrics.

… (claude #934 round 5)

Two findings on the 1b166a5 head:

1. recordingObs.RestoreExecuted (test double) used snapIndex - have,
   which underflows when have > snapIndex - the exact case the
   round-3 production fix handled. Tests passed only because both
   sides wrapped to the same value. Update the test double to
   mirror monitoring.ColdStartObserver's |snapIndex - have| and
   tighten TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries
   from a length-only assertion to the absolute gap (50).

2. The 'gap_ahead' / 'gap_behind' log fields used target (committed
   tail) as the baseline while the Prometheus gauges use snapIndex.
   Operators correlating the structured log against the metric saw
   different magnitudes whenever hardState.Commit > snap.Index
   (the post-replay normal case). Rename the log fields to be
   explicit:
   - last_committed_index (was last_wal_index; the helper now
     returns max(snap.Index, hardState.Commit), so the new name is
     more accurate)
   - gap_ahead_snapshot (skipped path; mirrors metric)
   - gap_ahead_committed (skipped path; new committed-tail relative)
   - gap_to_snapshot (executed path; absolute value, no underflow)
   - gap_behind_committed (executed path; matches gap_behind name
     semantics)

   Extracted the executed arm into reportColdStartExecute so
   reportColdStart stays under the cyclop limit (the abs-value
   computation pushes the inline branch over).

Caller audit: reportColdStartExecute is private to wal_store.go,
single call site from reportColdStart. recordingObs.RestoreExecuted
is the test double; assertion update was straightforward (1 line).

Tests + lint clean on ./internal/raftengine/etcd/ + ./monitoring/.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 7, 2026

@claude review

Addresses both round-5 findings:

  1. Finding 1 — recordingObs test double drift. recordingObs.RestoreExecuted updated to mirror monitoring.ColdStartObserver's absolute-value computation (|snapIndex - have|). Tightened TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries to assert the semantically correct gap (50) instead of a length-only check.

  2. Finding 2 — log/metric baseline drift. Renamed log fields to disambiguate the two baselines:

    • Skipped: gap_ahead_snapshot (= metric baseline) + gap_ahead_committed (= committed-tail relative).
    • Executed: gap_to_snapshot (absolute, no underflow) + gap_behind_committed.
    • last_wal_indexlast_committed_index since the helper now returns max(snap.Index, hardState.Commit).

    Extracted the executed arm into reportColdStartExecute to keep reportColdStart under cyclop.

Finding 3 (defensive Commit < snapIndex guard) — confirmed correct as-is by claude; no change.

go test + golangci-lint clean on ./internal/raftengine/etcd/ + ./monitoring/.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (d9bf745):

  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 7, 2026

Claude finished @bootjp's task in 1m 55s —— View job


PR #934 Round-6 Review — recordingObs fix + log/metric baseline disambiguation (commit d9bf745)

Todo list:

  • Load tools
  • Read wal_store.go — renamed log fields + reportColdStartExecute extraction
  • Read wal_store_skip_gate_test.gorecordingObs.RestoreExecuted abs-value fix + tightened gap assertion
  • Read monitoring/cold_start.go + monitoring/cold_start_test.go
  • Verify Finding 1 (abs-value test double) is correctly fixed
  • Verify Finding 2 (log/metric baseline disambiguation) is correctly done
  • Check cyclop compliance of extracted reportColdStartExecute
  • TLA+ spec: kv/fsm.go anchor — no new divergence
  • Post final review

Overall assessment

Both round-5 findings are correctly and completely closed. No new issues introduced. The implementation is clean.


R5-F1 — recordingObs.RestoreExecuted drift — Closed correctly ✓

The test double now mirrors monitoring.ColdStartObserver exactly:

// wal_store_skip_gate_test.go:91-106
func (o *recordingObs) RestoreExecuted(snapIndex, have uint64) {
    var gap uint64
    if have >= snapIndex {
        gap = have - snapIndex
    } else {
        gap = snapIndex - have
    }
    o.executed = append(o.executed, gap)
}

The tightened assertion in TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries (appliedIdx=150, snapIndex=100):

require.Equal(t, []uint64{appliedIdx - snapIndex}, obs.executed,
    "observer MUST record the absolute snapshot-relative gap")

Resolves to []uint64{50} — the semantically correct absolute gap, not []uint64{18446744073709551566} (underflow). The existing TestSkipGate_ExecutesWhenFSMStale (snapIndex=200, appliedIdx=100) continues to assert []uint64{snapIndex - appliedIdx} = []uint64{100}, which still equals the absolute gap since snapIndex > have in that case. ✓


R5-F2 — Log/metric baseline drift — Closed correctly ✓

The two-field disambiguation in reportColdStart (skip arm, wal_store.go:477-483):

zap.Uint64("gap_ahead_snapshot", have-snapIndex),   // metric baseline: have >= snapIndex guaranteed by gate
zap.Uint64("gap_ahead_committed", have-target),     // target = max(snap.Index, hardState.Commit)

And in the extracted reportColdStartExecute (wal_store.go:214-219):

zap.Uint64("gap_to_snapshot", gapToSnapshot),        // |snapIndex - have|, cannot underflow
zap.Uint64("gap_behind_committed", target-have),     // target > have guaranteed by gate condition
zap.Uint64("last_committed_index", target),

Both the comment ("// Two named gap fields so an operator correlating the log against the Prometheus gauge sees consistent magnitudes") and the field naming give operators an unambiguous correlation path to monitoring.ColdStartObserver. ✓

Underflow safety of log fields verified:

  • have - snapIndex (skip): skip gate requires have >= target >= snapIndex → no underflow. ✓
  • have - target (skip): skip gate requires have >= target → no underflow. ✓
  • target - have (execute): execute arm requires have < target → no underflow. ✓
  • gapToSnapshot (execute): explicit abs-value conditional → safe for all orderings. ✓

reportColdStartExecute extraction — Cyclop justified ✓

The extraction is the minimal change to keep reportColdStart under the cyclop limit. The abs-value conditional in the execute arm (if have >= snapIndex) adds one branch that would push the parent over. The split is clean: reportColdStartExecute has a single responsibility (log the execute arm) and no callers other than reportColdStart. ✓


TLA+ spec — No divergence ✓

kv/fsm.go anchor: this commit touches only wal_store.go (log-field renames + reportColdStartExecute extraction) and wal_store_skip_gate_test.go/monitoring/cold_start_test.go. No HLC issuance path, OCC commit-ts, MVCC visibility, or routing-catalog logic is touched. HLC-4 preconditions (i)(ii)(iii) unaffected. No spec update required.


One minor note (no action required)

TestSkipGate_SkipsWhenFSMFreshEnough and TestSkipGate_ReturnsEffectiveAppliedOnSkip both call restoreSnapshotState with lastWalIndex = snap.Metadata.Index (empty WAL tail). The skip path with non-empty WAL (have = lastWalIndex > snap.Metadata.Index) is still not tested directly — this was R3-F2 (open, nice-to-have) and remains so. Not a blocker; the gate logic is well-covered by TestColdStartSkipThreshold and TestSkipGate_ExecutesWhenWALCarriesPostSnapshotEntries, and the anti-double-apply mechanism (setApplied advance-only + applyNormalCommitted skip) is unit-tested independently.


Round summary

Finding Severity Status
R1-F1 — header-error-before-CRC ordering (doc-only) Minor Open (doc comment; no safety impact)
R3-F2 — skip path with non-empty WAL untested Minor Open (nice-to-have)
R5-F1recordingObs.RestoreExecuted underflow, masks correct abs-gap assertion Minor ✅ Closed (d9bf745)
R5-F2gap_ahead log uses committed-tail baseline, metric uses snapshot baseline Minor ✅ Closed (d9bf745)
TLA+ spec divergence n/a ✅ None

The implementation is correct and complete. All major and minor actionable findings across six review rounds are resolved. The two remaining open items are documentation-only (R1-F1) and test-coverage quality (R3-F2); neither affects production correctness, safety, or the soak-plan observability signals. Ready to merge.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9bf745f96

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/raftengine/etcd/engine.go Outdated
Comment on lines +2236 to +2237
if entry.Index <= e.applied {
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve post-snapshot HLC lease effects when skipping replay

When the committed tail contains an HLC lease entry after the snapshot followed by a KV mutation that advances metaAppliedIndex, the skip gate seeds e.applied at the tail and this guard drops both entries. ApplySnapshotHeader restores only the older snapshot-time ceiling, so the post-snapshot lease's in-memory SetPhysicalCeiling effect is never reconstructed. If that lease ceiling is still ahead of wall time, the restarted node can retain persisted timestamps above its HLC ceiling and reject fenced timestamp issuance or later issue timestamps from a stale ceiling; duplicate data mutations must be suppressed without suppressing volatile-only entries such as HLC leases.

Useful? React with 👍 / 👎.

Comment thread monitoring/cold_start.go
Comment on lines +39 to +42
appliedIndexGap: prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "elastickv_fsm_cold_start_applied_index_gap",
Help: "Absolute distance between the snapshot pointer and the FSM's reported applied index at cold start, in Raft log entries. outcome=skipped reports how far ahead the live store was; outcome=executed reports how far behind it was.",
}, []string{"outcome"}),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Partition cold-start metrics by Raft group

In multi-group mode, buildShardGroups opens several engines through one factory carrying the same registry observer, but this gauge is labeled only by outcome. Each shard therefore writes the same time series during startup, and the final value is merely whichever group's callback ran last rather than the gap for an identifiable FSM; the counters likewise aggregate multiple gate decisions despite the documentation describing at most one event per process. Include the Raft group in the observer or metric labels so multi-shard alerts and diagnostics remain meaningful.

Useful? React with 👍 / 👎.

… P1 #934 round 7)

When the cold-start skip gate fired the engine seeded `e.applied` at
the WAL committed tail and the duplicate guard
(`entry.Index <= e.applied { return nil }`) then dropped every replayed
entry — including HLC lease entries (raftEncodeHLCLease, tag 0x02).
HLC leases have purely in-memory effects (`HLC.SetPhysicalCeiling`),
so their post-snapshot ceiling raise was lost on every cold start.
`ApplySnapshotHeader` then restored only the older snapshot-time
ceiling and the next leader-issued fenced timestamp could collide
with persisted commit_ts values stamped under the lost lease.

Fix:
  - Add raftengine.VolatileEntryClassifier — an optional FSM seam
    that classifies cleartext FSM payloads as volatile-only or
    data-mutating.
  - kvFSM.IsVolatileOnlyPayload returns true iff payload[0] ==
    raftEncodeHLCLease. Static iface assertion in
    fsm_applied_index_iface_check.go.
  - applyNormalEntry takes a new dropIfNonVolatile flag. On the
    cold-start duplicate path the engine still decodes the envelope
    and decrypts (post-cutover HLC leases are encrypted) so the
    cleartext tag is visible, then either:
      * passes the payload to fsm.Apply (volatile — replayed for
        its in-memory monotonic effect, NO setApplied, NO
        resolveProposal); or
      * returns (nil, nil) (data-mutating — dropped, preserving the
        OCC / sidecar idempotency from PR #934 round 1).
  - applyNormalCommitted forwards `duplicate` to applyNormalEntry
    and only advances setApplied + resolveProposal on the fresh
    path.

Test:
  - cold_start_volatile_replay_test.go pins three contracts:
    (1) duplicate HLC lease → fsm.Apply called, e.applied unchanged;
    (2) duplicate KV/MVCC → fsm.Apply NOT called, e.applied unchanged;
    (3) FSM that does not implement the classifier → drop-all (no
        opt-in regression).
  - Also pins the fresh-entry asymmetry: every tag reaches fsm.Apply
    when entry.Index > e.applied.

Self-review:
  1. Data loss: HLC lease re-apply is monotonic
     (SetPhysicalCeiling) — no regression, no double-write. The
     guard for data-mutating entries is unchanged.
  2. Concurrency: classifier called inside the existing
     applyNormalEntry path on the same goroutine — no new locks.
  3. Performance: cold-start replay only; one extra envelope decode
     + (post-cutover) decryption per duplicate. Bounded by WAL tail
     length; not on a steady-state hot path.
  4. Consistency: classifier sees the cleartext payload (post-
     decrypt), so post-cutover HLC leases are still recognised.
     Encryption opcodes (0x03..0x07) correctly classified as data-
     mutating — re-apply would diverge the sidecar
     RaftAppliedIndex.
  5. Test coverage: three table-driven cases pinning every arm
     (volatile-replay, data-drop, classifier-absent), plus the
     fresh-entry asymmetry table.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (20f90a9):

  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 7, 2026

@claude review

Round 7 addresses codex P1: post-snapshot HLC lease effects were lost when the cold-start skip gate fired.

Root cause. The Round-1 idempotency guard
(if entry.Index <= e.applied { return nil }) treated every duplicate uniformly,
but HLC lease entries (raftEncodeHLCLease, tag 0x02) carry purely in-memory
effects (HLC.SetPhysicalCeiling). After the skip gate seeded e.applied at the
WAL committed tail, the guard dropped all post-snapshot lease entries.
ApplySnapshotHeader restored only the older snapshot-time ceiling, so the next
leader-issued fenced timestamp could collide with persisted commit_ts values
stamped under the lost lease.

Fix.

  • New optional FSM seam raftengine.VolatileEntryClassifier
    classifies cleartext FSM payloads as volatile-only or data-mutating.
  • kvFSM.IsVolatileOnlyPayload returns true iff payload[0] == raftEncodeHLCLease.
    Static iface assertion pins the contract.
  • applyNormalEntry now takes a dropIfNonVolatile flag. On the cold-start
    duplicate path the engine still decodes the envelope and decrypts
    (post-cutover HLC leases are encrypted), then either passes the payload to
    fsm.Apply (volatile — monotonic in-memory effect, no setApplied, no
    resolveProposal) or returns (nil, nil) (data-mutating — dropped, preserving
    the Round-1 OCC / sidecar idempotency).
  • applyNormalCommitted forwards duplicate to applyNormalEntry and only
    advances setApplied + resolveProposal on the fresh path.

Test. cold_start_volatile_replay_test.go pins three contracts:

  1. duplicate HLC lease → fsm.Apply called, e.applied unchanged
  2. duplicate KV/MVCC → fsm.Apply NOT called, e.applied unchanged (Round-1 regression locked down)
  3. FSM that doesn't implement the classifier → drop-all (no opt-in regression)

Plus the fresh-entry asymmetry: every tag still reaches fsm.Apply when
entry.Index > e.applied.

Caller audit. applyNormalEntry signature widened; every existing call site
(applyNormalCommitted + 10 encryption_test.go sites) passes false,
preserving the pre-PR semantics.

Open items not addressed in this commit.

  • codex P2 (factory plumbing for ColdStartObserver) — separate concern
    scoped at Factory.Create / OpenConfig; will follow up if you want it
    bundled or in a follow-up PR.
  • codex P2 (per-group metric labels) — same call; meaningful only after the
    factory plumbing lands so the registry observer is actually fed.

CI note: previous run hit a 504 on reviewdog install — transient infra, not a
lint failure on the code.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 7, 2026

Claude finished @bootjp's task in 3m 42s —— View job


PR #934 Round-7 Review — Volatile HLC lease replay on cold-start skip (commit 20f90a9)

Todo list:

  • Read trigger comment and classify request
  • Read internal/raftengine/etcd/engine.goapplyNormalEntry + applyNormalCommitted changes
  • Read internal/raftengine/statemachine.goVolatileEntryClassifier interface
  • Read kv/fsm.goIsVolatileOnlyPayload implementation
  • Read kv/fsm_applied_index_iface_check.go — compile-time guard
  • Read internal/raftengine/etcd/cold_start_volatile_replay_test.go — new test contracts
  • Review correctness of volatile-replay fix (HLC-4 / OCC safety)
  • Review dropIfNonVolatile path: encryption handling, setApplied / resolveProposal suppression
  • Review test coverage completeness (3 contracts + fresh-entry asymmetry)
  • TLA+ spec divergence check (kv/fsm.go anchor — SetPhysicalCeiling path)
  • Post final review

Overall assessment

The fix is correct. The root cause analysis is accurate, the seam is well-designed, and the three contracts tested are the right ones. One finding and one minor note below.


Finding 1 — SetApplyIndex is called with a stale index for volatile replays (Minor / informational)

Location: internal/raftengine/etcd/engine.go:2407

When applyNormalEntry takes the volatile-replay path (dropIfNonVolatile=true, IsVolatileOnlyPayload=true), it calls SetApplyIndex(entry.Index) before fsm.Apply(payload):

// In applyNormalEntry (line 2407)
if aware, ok := e.fsm.(raftengine.ApplyIndexAware); ok {
    aware.SetApplyIndex(entry.Index)   // ← entry.Index is BELOW e.applied on the duplicate path
}
return e.fsm.Apply(payload), nil

For a duplicate entry at index 150 with e.applied = 200, this sets kvFSM.pendingApplyIdx = 150. Since applyHLCLease doesn't read pendingApplyIdx, this has no observable effect on the HLC ceiling raise — the replay works correctly.

However, it is a semantic inconsistency: pendingApplyIdx is documented as "the Raft entry index the engine is about to apply" (kv/fsm.go:42) and it is consumed by applyEncryption (encryption sidecar RaftAppliedIndex) and ApplyMutationsRaftAt (sidecar sync). Since IsVolatileOnlyPayload must return false for encryption opcodes (enforced by the comment in kv/fsm.go:641-644), those consumers will never run on the volatile path. The risk is zero for the current implementation — but a future volatile entry type that also happens to use pendingApplyIdx for a durability side-effect would silently write a stale index.

The safest fix is to skip SetApplyIndex on the duplicate path entirely. Since volatile Apply (HLC lease) never reads pendingApplyIdx, this produces no behaviour change:

// Only SetApplyIndex on the non-duplicate path
if !dropIfNonVolatile {
    if aware, ok := e.fsm.(raftengine.ApplyIndexAware); ok {
        aware.SetApplyIndex(entry.Index)
    }
}
return e.fsm.Apply(payload), nil

Not a correctness bug today. Worth documenting at minimum.

Consider fixing →


Finding 2 — No test for the encrypted-volatile path (Minor)

Location: internal/raftengine/etcd/cold_start_volatile_replay_test.go

All three tests use newTestEngine(fsm, nil, nil) — no cipher, no cutover, so the decryption branch in applyNormalEntry is never exercised. The production scenario that matters most is:

  • e.raftCutoverIndex() is active (post-Stage-8a cluster)
  • A duplicate HLC lease at index 150 is encrypted (wrapped by WrapRaftPayload)
  • The engine must: decrypt → classify as volatile → replay

The correct ordering is locked in the code (decryption runs before IsVolatileOnlyPayload), but is not tested. A test using envelopeEntry from encryption_test.go with a volatileTagFakeFSM and a real cipher/cutover would close this gap:

// pseudo-sketch
c, kid := raftCipherFixture(t)
cutover := func() uint64 { return 100 }
fsm := &volatileTagFakeFSM{}
e := newTestEngine(fsm, c, cutover)
e.applied = 200

plainPayload := []byte{0x02, ...}   // HLC lease tag
entry := envelopeEntry(t, c, kid, 150, plainPayload)  // encrypted at index 150 (below e.applied=200)

err := e.applyNormalCommitted(entry)
// verify fsm.calls = 1, e.applied = 200

Not a blocker, but the current tests only cover the non-encrypted (pre-cutover) path.


Correctness of the fix — verified ✓

Property Verification
HLC ceiling monotonicity: replay is idempotent applyHLCLeaseSetPhysicalCeiling is a monotonic raise; replaying raises ceiling to at least the post-snapshot level ✓
OCC safety: data-mutating duplicates not replayed IsVolatileOnlyPayload = falseapplyNormalEntry returns (nil, nil) before SetApplyIndex/Apply
setApplied not advanced for any duplicate applyNormalCommitted:2260 — both volatile and non-volatile duplicates hit if duplicate { return nil } before setApplied
resolveProposal not called for volatile replay Same arm; no coordinator waiting for HLC lease proposals ✓
HaltApply guard runs before duplicate arm applyNormalCommitted:2252-2259 — checked on both fresh and duplicate path; HLC lease Apply returns nil so no halt risk ✓
FSMs without classifier get drop-all (no regression) !ok → return nil, nil in the type-assertion guard; TestApplyNormalCommitted_DuplicateWithoutClassifier_DropsAll pins it ✓
Post-cutover (encrypted) HLC leases: decryption before classification applyNormalEntry decrypts at line 2370-2391, then classifies at line 2401-2406 ✓
Compile-time guard against interface drift var _ raftengine.VolatileEntryClassifier = (*kvFSM)(nil) in kv/fsm_applied_index_iface_check.go

TLA+ spec — no divergence ✓

kv/fsm.go anchor: IsVolatileOnlyPayload is a pure classification (no state change — explicitly documented). applyHLCLease / SetPhysicalCeiling path is unchanged from previous rounds. The fix ensures the ceiling is raised to at least the post-snapshot level, which is the HLC-4 precondition (i) requirement (bounded skew: newly elected leader's ceiling ≥ the largest ceiling stamped on any committed entry). HLC-4 preconditions (ii) and (iii) are unaffected. No OCC commit-ts, MVCC visibility, or routing-catalog logic is touched. No spec update required.


Round summary

Round Finding Severity Status
R1-F1 Header-error-before-CRC ordering (doc-only) Minor Open (doc comment)
R4–R6 All major findings (apply-floor advance, observer args, emit-after-success, abs-gap, log baseline) Major/Minor ✅ Closed
R7-F1 SetApplyIndex called with stale index on volatile replay Minor / informational Open (no current risk)
R7-F2 Encrypted-volatile path not tested Minor Open (nice-to-have)
TLA+ spec divergence No divergence ✅ None
Codex P1 #934 round 7 Post-snapshot HLC lease ceiling lost on skip ✅ Correct

The codex P1 root cause is correctly diagnosed and the fix is complete. Ready to merge; the two open items are informational and don't affect production correctness or the soak plan.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

…pted-volatile path (claude #934 round 7 F1+F2)

R7-F1 (Minor / informational): on the cold-start duplicate-replay
path entry.Index is BELOW e.applied, so calling SetApplyIndex would
feed a stale index to any future durability sink (encryption sidecar
RaftAppliedIndex, ApplyMutationsRaftAt's metaAppliedIndex bundle).
Today's only volatile entry (HLC lease) does not read pendingApplyIdx
so this was a no-op, but it future-proofs the seam against a new
volatile entry type that also persists. SetApplyIndex now only fires
on the !dropIfNonVolatile arm.

R7-F2 (Minor / nice-to-have): the round-7 test fixtures all used
newTestEngine(fsm, nil, nil) — no cipher, no cutover — so the
decryption-before-classification ordering in applyNormalEntry was
locked in code but not in tests. Added
TestApplyNormalCommitted_VolatileDuplicate_PostCutoverEncrypted which
wraps an HLC lease via WrapRaftPayload at an above-cutover index
that's still BELOW e.applied. Verifies the engine decrypts, classifies
as volatile, replays for the in-memory effect, and hands fsm.Apply the
cleartext payload byte-for-byte.

While extracting cyclop budget, factored the dropIfNonVolatile classifier
arm into Engine.isVolatilePayload. Same semantics; cleaner call site.

Self-review:
  1. Data loss: no — replay path unchanged, SetApplyIndex
     suppression has no observable effect for HLC lease (the only
     current volatile type) but prevents future regression.
  2. Concurrency: no new locks, no new goroutines. Same serialised
     apply loop.
  3. Performance: one extra method call per cold-start duplicate.
     Bounded by WAL tail length; not on a steady-state path.
  4. Consistency: encrypted-volatile path now pinned by test —
     post-cutover HLC lease decrypt → classify → replay is locked
     against future drift.
  5. Test coverage: added explicit encrypted-volatile test;
     existing 3 contracts (volatile-replay / data-drop /
     classifier-absent) and fresh-entry asymmetry still pass.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (fc55c7b):

  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 7, 2026

@claude review

Round 7 F1+F2 follow-up — both Minor findings from your last review are now closed.

R7-F1 — SetApplyIndex stale-index on volatile replay. Moved
SetApplyIndex inside an if !dropIfNonVolatile {…} guard so it only
fires on the fresh-entry arm. Today's only volatile entry (HLC lease)
doesn't read pendingApplyIdx so this is observably a no-op, but it
future-proofs the seam against a new volatile entry type that would
also persist (the seam's documented purpose).

R7-F2 — Encrypted-volatile path untested. Added
TestApplyNormalCommitted_VolatileDuplicate_PostCutoverEncrypted
which uses raftCipherFixture + envelopeEntry to wrap an HLC lease
above the cutover index but BELOW e.applied. Verifies:

  1. fsm.Apply reached once (decrypt → classify → replay ordering),
  2. FSM receives the cleartext payload byte-for-byte (classifier must
    see post-decrypt bytes),
  3. e.applied pinned at 200 (no advance on duplicate arm).

Refactor note. Pulled the classifier check into
Engine.isVolatilePayload to stay under the cyclop budget — same
semantics, cleaner call site.

Tests + lint clean locally:

  • go test -race ./internal/raftengine/etcd/ ./kv/... ok
  • golangci-lint run ./internal/raftengine/... ./kv/... 0 issues

R1-F1 (header-error-before-CRC ordering doc comment) is the only
remaining open item; you flagged it as doc-only, no safety impact.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 7, 2026

Claude finished @bootjp's task in 2m 35s —— View job


PR #934 Round-8 Review — R7-F1 + R7-F2 follow-up (commit fc55c7b)

Todo list:

  • Load tools
  • Read engine.goSetApplyIndex guard + isVolatilePayload refactor
  • Read cold_start_volatile_replay_test.go — encrypted-volatile test
  • Verify R7-F1: SetApplyIndex guard correctness
  • Verify R7-F2: encrypted-volatile test coverage
  • Review isVolatilePayload extraction
  • TLA+ spec: kv/fsm.go anchor check
  • Post final review

Overall assessment

Both R7-F1 and R7-F2 are correctly and completely closed. No new issues. The PR is clean.


R7-F1 — SetApplyIndex stale-index on volatile replay — Closed correctly ✓

The guard is correctly placed:

// engine.go:2405-2409
if !dropIfNonVolatile {
    if aware, ok := e.fsm.(raftengine.ApplyIndexAware); ok {
        aware.SetApplyIndex(entry.Index)
    }
}
return e.fsm.Apply(payload), nil

The three code paths through applyNormalEntry after the change:

dropIfNonVolatile IsVolatileOnlyPayload SetApplyIndex called? fsm.Apply called?
false (fresh entry) any ✓ yes, at entry.Index ✓ yes
true (duplicate), data-mutating false ✗ returns (nil,nil) early at line 2392
true (duplicate), volatile true ✗ guarded by !dropIfNonVolatile ✓ yes, with cleartext payload

All three paths are correct. The comment explains the future-proofing rationale clearly. The change is a pure no-op for all currently existing volatile entries (only HLC lease, which reads no pendingApplyIdx) but guards the seam against a future volatile type that also persists.


R7-F2 — Encrypted-volatile path untested — Closed correctly ✓

TestApplyNormalCommitted_VolatileDuplicate_PostCutoverEncrypted correctly exercises the full decrypt → classify → replay ordering:

cipher wired, cutover=100, e.applied=200
plain := []byte{0x02, 0, 0, 0, 0, 0, 0, 0, 0x01}
entry := envelopeEntry(t, c, kid, 150, plain)

Execution trace through applyNormalEntry:

  1. decodeProposalEnvelope → extracts the wrapped (encrypted) payload
  2. entry.Index(150) > cutover(100)unwrapRaftPayloadpayload = plain = [0x02, ...]
  3. dropIfNonVolatile(true) && !isVolatilePayload([0x02,...])isVolatilePayload = true → does NOT return early
  4. !dropIfNonVolatile = falseSetApplyIndex suppressed
  5. fsm.Apply([0x02,...]) with cleartext

Three assertions:

  • fsm.calls = 1 — decryption ran before classification, lease was replayed ✓
  • fsm.lastPayload == plain — classifier received post-decrypt bytes ✓
  • e.applied == 200 — not advanced on duplicate arm ✓

This is precisely the scenario R7-F2 flagged. The test is correctly structured and locks in the ordering that protects against future regressions.


isVolatilePayload extraction — Clean ✓

The extraction is a pure refactor:

  • Before: inline type assertion + ok check in applyNormalEntry (would push past cyclop budget with the additional if !dropIfNonVolatile branch)
  • After: extracted to Engine.isVolatilePayload, identical semantics

The method comment accurately documents the "runs AFTER decryption" contract — critical for anyone reading the method in isolation. The !ok → false default is correctly stated as "preserving pre-PR semantics for FSMs that haven't opted in." ✓


TLA+ spec — No divergence ✓

This commit touches only engine.go and cold_start_volatile_replay_test.go. The kv/fsm.go anchor file (which triggers the TLA+ check on every push) is unchanged from the previous commits. As confirmed in Round-7:

  • applyHLCLease / SetPhysicalCeiling path: unchanged, HLC-4 preconditions (i)/(ii)/(iii) unaffected
  • ApplySnapshotHeader: unchanged
  • No OCC commit-ts, MVCC visibility, or routing-catalog changes

No spec update required.


Complete finding tracker (all rounds)

Round Finding Severity Status
R1-F1 Header-error-before-CRC ordering (doc comment) Minor Open (doc-only, no safety impact)
R1-F2 → R3-F2 Skip path coverage completeness Minor ✅ Superseded by committed-index gate restructure
R3-F1 → R6 Observer docstrings / log field alignment Minor ✅ Closed
R4 Apply-floor advance on skip (Codex P1) Critical ✅ Closed
R4 Observer args (Codex P2 / CodeRabbit Major) Major ✅ Closed
R4 Emit-after-success (CodeRabbit Major) Major ✅ Closed
R5 Gate on hardState.Commit not WAL tail Major ✅ Closed
R5 RestoreExecuted underflow in gauge Major ✅ Closed
R6 Test double abs-gap drift Minor ✅ Closed
R7 Post-snapshot HLC lease ceiling lost on skip (Codex P1) Critical ✅ Closed
R7-F1 SetApplyIndex with stale index on volatile replay Minor Closed (fc55c7b)
R7-F2 Encrypted-volatile path untested Minor Closed (fc55c7b)
TLA+ spec No divergence across all rounds ✅ None

The sole remaining open item is R1-F1 — a doc comment explaining the header-parse-before-body-CRC ordering in applyHeaderStateOnSkip. You confirmed this as doc-only with no safety impact. If you'd like to add it before merging, the location is wal_store.go at the applyHeaderStateOnSkip function body. Otherwise the PR is ready to merge.

@bootjp bootjp enabled auto-merge June 7, 2026 10:18
…tateOnSkip (claude #934 R1-F1)

R1-F1 was the only open finding across 8 review rounds, marked doc-
only with no safety impact. Add a comment in applyHeaderStateOnSkip
spelling out why header parse runs before the body-CRC compare:

  - Both header-parse errors and CRC errors are equally fatal for the
    skip path (snapshot file corruption) and both must propagate
    without ever calling ApplySnapshotHeader.
  - CRC compare stays AFTER the header parse so the TeeReader has
    actually been drained when h.Sum32() is read; inverting the order
    would let a CRC mismatch surface on a truncated body even when
    the header was valid, muddying the operator-facing diagnostic.
  - Matches the same ordering openAndRestoreFSMSnapshot uses, so
    fallback paths see consistent error chains regardless of which
    routine ran.

No code change; this is the only safe artifact of an audit comment.
Tests pass; lint clean.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 7, 2026

@claude review

Final cleanup: R1-F1 doc comment in applyHeaderStateOnSkip explaining the header-error-before-CRC ordering. No code change. Every finding across 8 rounds is now closed.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 7, 2026

Claude finished @bootjp's task in 1m 34s —— View job


PR #934 Round-9 Review — R1-F1 doc comment (final cleanup, commit 58db40c)

Todo list:

  • Load tools
  • Read wal_store.go — locate applyHeaderStateOnSkip and verify the doc comment
  • Verify the comment accurately explains the header-error-before-CRC ordering
  • Confirm TLA+ spec still clean
  • Post final review

R1-F1 — Header-error-before-CRC ordering — Closed correctly ✓

The inline comment at wal_store.go:539-551 is technically accurate and well-reasoned:

// Error-ordering contract (claude #934 R1-F1): header parse
// errors surface BEFORE the body-CRC compare runs, so callers
// (the skip-gate fallback in restoreSnapshotState) may observe
// either an ErrSnapshotHeaderUnknownMagic / InvalidLength chain or
// an ErrFSMSnapshotFileCRC chain depending on which check fails
// first. This is the same ordering openAndRestoreFSMSnapshot has
// — both errors are equally fatal for the skip path (they signal
// snapshot file corruption) and both must propagate without ever
// calling ApplySnapshotHeader. The CRC check stays AFTER the
// header parse so the TeeReader has actually been drained before
// we read h.Sum32(); inverting the order would let a CRC mismatch
// surface on a truncated body even when the header was valid,
// muddying the operator-facing diagnostic.

Three claims verified against the code:

Claim Code Verified
Header parse errors surface before body-CRC readSnapshotHeaderOrDrain at line 560 returns before h.Sum32() at line 565
TeeReader must be drained before h.Sum32() is meaningful tee := io.TeeReader(io.LimitReader(file, payloadSize), h) then drained through ParseSnapshotHeader/io.Copy before CRC read
Same ordering as openAndRestoreFSMSnapshot restoreAndComputeCRC in the full-restore path also runs header parse inside the TeeReader loop before the final CRC verify

The function-level docstring (lines 503-517) is also complete and accurate:

  • "CRC verification still runs [for FSMs without SnapshotHeaderApplier]" — confirmed: readSnapshotHeaderOrDrain with hasSetter=false calls io.Copy(io.Discard, tee) to drain all payload bytes through h, ensuring every byte is hashed regardless of whether the FSM exposes the seam. ✓
  • "On any verification failure the typed error propagates and FSM state stays untouched." — ApplySnapshotHeader is gated on hasSetter AND only called after both CRC checks pass. ✓

One very minor note (no action required): The attribution (claude #934 R1-F1) in the inline comment technically violates CLAUDE.md's convention ("Don't reference the current task, fix, or callers … since those belong in the PR description and rot as the codebase evolves"). The technical explanation stands without the attribution tag. It's a negligible style point and doesn't affect correctness; mentioning it only for completeness.


TLA+ spec — No divergence ✓

This commit is documentation-only — no code changed. All HLC-4 preconditions, OCC commit-ts ordering, MVCC visibility invariants, and routing-catalog contracts are unaffected. No spec update required.


Complete finding tracker (all rounds)

Round Finding Severity Status
R1-F1 Header-error-before-CRC ordering (doc comment) Minor Closed (58db40c)
R1-F2 → R3-F2 Skip path coverage completeness Minor ✅ Superseded by committed-index gate restructure
R4 Apply-floor advance on skip (Codex P1) Critical ✅ Closed
R4 Observer args (Codex P2 / CodeRabbit Major) Major ✅ Closed
R4 Emit-after-success (CodeRabbit Major) Major ✅ Closed
R5 Gate on hardState.Commit not WAL tail Major ✅ Closed
R5 RestoreExecuted underflow in gauge Major ✅ Closed
R6 Test double abs-gap drift Minor ✅ Closed
R6 Log/metric baseline drift Minor ✅ Closed
R7 Post-snapshot HLC lease ceiling lost on skip (Codex P1) Critical ✅ Closed
R7-F1 SetApplyIndex with stale index on volatile replay Minor ✅ Closed
R7-F2 Encrypted-volatile path untested Minor ✅ Closed
TLA+ spec No divergence across all rounds ✅ None

All findings across 9 rounds are closed. The implementation is correct, the soak-plan metrics are wired end-to-end, the safety invariants are preserved, and the documentation now fully explains the non-obvious ordering decisions. Ready to merge.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (58db40c):

  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 58db40cd6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

voterCount: len(prepared.disk.LocalSnap.Metadata.ConfState.Voters),
isLearnerNode: learnerSetFromConfState(prepared.disk.LocalSnap.Metadata.ConfState),
applied: maxAppliedIndex(prepared.disk.LocalSnap),
applied: coldStartApplied(prepared.disk),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid persisting ahead-of-index snapshots during config replay

When the skip gate seeds e.applied to T for an FSM already ahead of snapshot S, Raft still re-emits committed configuration entries between S and T because newRawNode does not set an applied floor. For a configuration entry at C, applyConfChangeCommitted calls persistConfigState(C, ...); since configIndex is still initialized from S, snapshotPayload(C) serializes the already-T FSM and persists it as a Raft snapshot whose metadata index is only C. Installing that snapshot elsewhere exposes state through T while Raft believes only C is applied, causing subsequent entries C+1..T to be replayed against ahead state. Duplicate configuration entries still need ApplyConfChange, but their config-snapshot persistence must be suppressed or otherwise kept aligned with the FSM state.

Useful? React with 👍 / 👎.

@bootjp bootjp merged commit ca0b4e9 into main Jun 7, 2026
13 checks passed
@bootjp bootjp deleted the impl/snapshot-skip-b3-skip-gate branch June 7, 2026 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant