Skip to content

Watchdog v1 (compare mode) + release bundle versioning#14

Merged
stephenctw merged 22 commits into
mainfrom
feature/watch-dog
Jun 22, 2026
Merged

Watchdog v1 (compare mode) + release bundle versioning#14
stephenctw merged 22 commits into
mainfrom
feature/watch-dog

Conversation

@stephenctw

@stephenctw stephenctw commented May 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR delivers two related scopes as one release:

1. Watchdog v1 — compare mode

Off-chain watchdog that compares sequencer finalized SSZ (GET /finalized_state) against canonical Cartesi Machine inspect at the same L1 inclusion block.

  • Compare path: Lua runner + machine_cartesi (prod) / machine_cli (harness); advance checkpointing retained
  • SSZ parity: shared wallet_snapshot encoding between sequencer, scheduler, and CM
  • L1 partition parity: shared fixture vector (tests/fixtures/l1_partition_vector.json) for Lua + Rust
  • Divergence signal: structured watchdog_event JSON on stderr; exit codes 0 ok / 1 transient / 2 deterministic mismatch; webhook/alarm.lua removed
  • Terminal errors: no retry on state_mismatch / inclusion_block_regressed
  • E2E: watchdog_genesis_compare_test + non-genesis compare at end of deposit_transfer_withdrawal_test; CI builds lcurl.so via just watchdog-lua-deps
  • Docs: operator deployment, getting started, staging drills

2. Release packaging — aligned artifact versions

Release tag vX is the bundle version for sequencer binaries, CM image tarballs, and watchdog.

  • Single pin source: release/versions.env (Rust, xgenext2fs, cartesi-machine, lua-curl)
  • CI/release: .github/actions/load-release-versions; scripts/verify-release-versions.sh in CI
  • Manifest: scripts/generate-release-manifest.shrelease-manifest-vX.json on GitHub Release
  • Watchdog image: watchdog/Dockerfile + docker save tarballs per arch; /opt/watchdog/RELEASE.json + OCI labels
  • Vendored lua-curl: full sources under watchdog/third_party/lua-curl/ (pinned via UPSTREAM / versions.env)

CARTESI_MACHINE_VERSION in versions.env must match the emulator inside the watchdog image and the one used to build canonical-machine-image-* tarballs. See release/README.md.

Historical note

Early commits introduced GET /get_state; later commits pivoted to /finalized_state + SSZ snapshot API (current design).

Test plan

  • lua watchdog/tests/run.lua
  • just test-watchdog-divergence-drill (exit 2 + watchdog_event)
  • cargo run -p rollups-e2e --bin rollups-e2e -- watchdog_genesis_compare_test --exact --nocapture
  • just test-rollups-e2e (includes deposit_transfer_withdrawal_test non-genesis compare)
  • bash scripts/verify-release-versions.sh
  • Staging compare daemon on Sepolia (operator drill 3)

Follow-ups (out of scope)

  • Enderson sign-off on exit-code / watchdog_event alerting contract
  • Port harness off machine_cli once in-process cartesi binding matches CLI archive format

@stephenctw stephenctw self-assigned this May 11, 2026
@stephenctw stephenctw force-pushed the feature/watch-dog branch from e424146 to f1796d2 Compare May 11, 2026 14:31
@stephenctw stephenctw force-pushed the feature/watch-dog branch 2 times, most recently from bb4bbc2 to e609033 Compare May 21, 2026 14:43
@stephenctw stephenctw marked this pull request as draft May 21, 2026 15:03
@stephenctw stephenctw force-pushed the feature/watch-dog branch 3 times, most recently from 45c89a5 to 0847b8c Compare May 22, 2026 14:33
@stephenctw stephenctw requested a review from GCdePaula May 24, 2026 11:30
@stephenctw stephenctw marked this pull request as ready for review May 24, 2026 11:30
@stephenctw stephenctw force-pushed the feature/watch-dog branch 6 times, most recently from 06fa4de to 73f7c19 Compare June 8, 2026 12:40
@stephenctw stephenctw changed the title Implement watch dog Implement watch dog + release packaging Jun 8, 2026
@stephenctw stephenctw changed the title Implement watch dog + release packaging Watchdog v1 (compare mode) + release bundle versioning Jun 8, 2026
@stephenctw stephenctw force-pushed the feature/watch-dog branch 2 times, most recently from a799571 to 2b7bf5b Compare June 12, 2026 09:29

@GCdePaula GCdePaula left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀🚀🚀

Wire sequencer safe-only state export, canonical CM inspect, and Lua
compare-mode E2E against Anvil devnet. Adds staging drill docs and
harness fixes (sequencer-devnet resolution, optional faketime).
Replace GET /get_state with finalized snapshot routes for operator compare.
Unify wallet SSZ encoding across sequencer dumps, CM inspect, and watchdog;
share L1 partition test vector with Rust. Restructure Lua modules (lcurl
binding, machine_cartesi, sequencer_reader), add devnet-stack helper and
production-like operator runbooks.
… e2e

Replace webhook alarms with watchdog_event JSON and exit codes 0/1/2.
Run genesis and non-genesis compare in rollups-e2e; build lcurl.so in CI
before the harness.
stephenctw and others added 13 commits June 22, 2026 21:53
Track pinned Lua-cURLv3 under watchdog/third_party/lua-curl so CI and
watchdog-lua-deps can build lcurl.so without a network fetch.
Centralize toolchain pins in release/versions.env; load them in CI and
release workflows. Add release manifest generation, version verification,
watchdog Docker image build, and docker-save tarballs per arch.
The staging drill intentionally exits like production on mismatch; wrap
the just target so local smoke runs succeed while direct invocation still
returns 2 for operators.
…rness)

Harden the watchdog release and test path: Dockerfile runs under bash with full
cartesi-machine runtime deps, CI adds a docker smoke job and divergence drill,
e2e exercises production main.lua via machine_cartesi, runner retries when
finalized inclusion_block moves during compare, non-genesis tests assert real
snapshot content, and the divergence drill drives main.run_compare_cycle with
an exit-2-aware wrapper script.
Clear stale checkpoint snapshots on retry, reject multi-chunk CM inspect, and
retry when L1 RPC latest head lags the sequencer target block. Pin lua-curl
tarball sha256, remove unused watchdog config and dead storage SQL, fix operator
docs and release prerelease handling, and document checkpoint retention.
Add a real non-genesis divergence e2e (devnet sequencer vs sepolia CM image),
drop machine_cli in favor of machine_cartesi-only compare paths, and give
watchdog its own justfile with a doctor recipe for local toolchain checks.
machine_cartesi must instantiate via cartesi.new():load(), not the
cartesi.machine userdata factory. Doctor now probes a real CM load, and
rollups-e2e ensures the sepolia machine image exists for the divergence
scenario.
Document doctor, divergence e2e, and main.lua success semantics; harden checkpoint rm, docker entrypoint, and docker-smoke arch detection.
…duction

Reshapes the watchdog to a single minimal compare path and closes the issues
found across review.

Simplify:
- One job, one shot. Removed the compare/advance mode split and the daemon loop.
  `init` records the canonical bootstrap state once; `tick` runs exactly one
  compare cycle and exits 0 (clean/idle) / 1 (transient) / 2 (divergence). Infra
  schedules re-runs and enforces non-overlap via flock; no in-process loop/lock.
- Removed the verified-bit / advance-checkpoint provenance machinery: a persisted
  checkpoint is verified by construction (only a successful compare writes one),
  so the cheap-skip needs no extra state.

Harden / correctness:
- Crash-safe keep-1 checkpointing: atomic head.json pointer flip + predecessor
  GC; only ever writes a fresh checkpoint dir (no destructive in-place rewrite).
- Bootstrap is verified at its own block before being trusted; exit 0 means
  verified-or-idle on every path (fails closed to 1 or 2, never a false OK).
- Stream bisected eth_getLogs into the CM (no whole-range materialization),
  order-equivalent to the global sort.
- L1 RPC URL supplied per tick, never persisted (no provider secret at rest).

Build / CI:
- Vendored lua-cURLv3 in-tree for a hermetic build (no build-time download or pin
  to verify); Docker image builds and smoke-runs require('cartesi') in CI;
  graceful scheduler Inspect (no guest panic on an unknown query).
- e2e drives the production binding incl. a real-component divergence test;
  e2e prerequisites hard-fail instead of skipping vacuously.

Docs updated for init/tick, head.json / config.json / WATCHDOG_STATE_DIR, flock,
and the checkpoint-backup story.
@stephenctw stephenctw merged commit a80b25c into main Jun 22, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants