Skip to content

Validator State Backup #2264

@sergerad

Description

@sergerad

We need a way to back up Validator state (validated transactions and blocks). There are various options to consider.

I would weigh these considerations the most when assessing the options:

  1. Backup should be async to the hot path
  2. Recovery point objective (RPO) should be low (window of potential data loss before backup complete)
  3. Code complexity / maintainability

Option 1: gRPC Stream API + Backup Process

The Validator gRPC API provides an endpoint for streaming historic and live data.

A new process is implemented which connects to the stream and dumps the data to DB and disk.

+ Backup is integrated into format (SQL/file) which Validator requires
+ Backup process is robust to failures (can retry/restart without impact)
+ Backup process is async to Validator's tx and block validation
- Most code complexity (gRPC endpoint, backup client)

Option 2: Backup to S3

The Validator performs backup of transactions and blocks to S3 directly.

+ Minimal code complexity
- Backup is not integrated into format (SQL/file) which Validator requires
- Backup is synchronous to Validator's tx and block validation
- Backup failure impacts Validator

Option 3: Spare Validator

The primary Validator forwards all requests it receives to a spare Validator.

+ Minimal code complexity
+ Backup is integrated into format (SQL/file) which Validator requires
- Backup is synchronous to Validator's tx and block validation
- Backup failure impacts Validator

Option 4: AWS EBS Snapshots

Periodic snapshots of EBS the Validator's volume.

+ Minimal code complexity
+ Backup is integrated into format (SQL/file) which Validator requires
- Hourly RPO

Option 5: Litestream (or similar)

Run an out-of-band process that streams sqlite data to S3.

+ Minimal code complexity / none
+ Can be used for Validator, Sequencer, RPCs etc
+ Backup is integrated into format (SQL/file) which Validator requires
+ Backup process is robust to failures (can retry/restart without impact)
+ Backup process is async to Validator's tx and block validation
- Only backs up SQLite data
- Adds an external operational dependency (a separate process to run/monitor)


Option 1. Async (off hot path) 2. Low RPO 3. Low code complexity Format-integrated Robust to failure Reusable across components
1. gRPC stream + backup process ✅ (~realtime) ❌ (most code) ❌ (validator-specific)
2. Backup to S3 (direct) ❌ (synchronous) ✅ (sync → near-zero) ❌ (raw dump) ❌ (failure blocks validator) 🟡 (custom per component)
3. Spare validator ❌ (synchronous) ✅ (sync → near-zero) 🟡 (forwarding + spare lifecycle) ❌ (failure blocks validator) ❌ (validator-specific)
4. EBS snapshots ❌ (hourly) ✅ (whole volume) ✅ (volume-level)
5. Litestream (or similar) ✅ (~1s) 🟡 (SQLite only, not file store) ✅ (validator, sequencer, RPC)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions