Skip to content

fix(archiver): validate checkpoint attestations before fetching blobs (A-1252)#24183

Closed
PhilWindle wants to merge 12 commits into
merge-train/spartan-v5from
phil/a-1252-archiver-validate-attestations-before-blob-decode
Closed

fix(archiver): validate checkpoint attestations before fetching blobs (A-1252)#24183
PhilWindle wants to merge 12 commits into
merge-train/spartan-v5from
phil/a-1252-archiver-validate-attestations-before-blob-decode

Conversation

@PhilWindle

@PhilWindle PhilWindle commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Summary

During L1 sync the archiver fetched and decoded checkpoint blob data before validating committee attestations. A selected L1 proposer could publish a checkpoint with invalid/insufficient attestations plus malformed-but-hash-matching blob data; the blob decode threw BlobDeserializationError before the invalid-attestation skip path ran, so the checkpoint was never recorded as rejected and the L1 sync point never advanced past it. The archiver then re-queried the same L1 blocks every poll and re-threw forever — taking any valid checkpoints in the same batch down with it. This is a sync-liveness / DoS issue at the L1 ingestion boundary (no invalid state accepted, no asset loss).

Fix

Validate attestations from L1 calldata first — the signed consensus payload (header, archive root, fee asset price modifier) is fully available without blobs. Reject invalid and descendant-of-rejected checkpoints (emitting the same events and persisting the same rejected entries as before) without fetching their blobs, then fetch blobs in parallel only for the surviving checkpoints and ingest.

  • validation.ts: add validateCheckpointAttestationsFromCalldata and extract the shared core validator (validateCheckpointAttestations now delegates to it).
  • l1_synchronizer.ts: reorder handleCheckpoints to validate-then-fetch; lastSeenCheckpoint is tracked from calldata ({ checkpointNumber, l1 }) since rejected checkpoints are no longer built into PublishedCheckpoints.
  • archiver-sync.test.ts: regression test — a checkpoint with invalid attestations and malformed blob data is rejected without a BlobDeserializationError retry loop.

Verification

  • New regression test fails on the unfixed code (BlobDeserializationError) and passes with the fix.
  • Stuck-without-fix behavior verified empirically: with the fix reverted, repeated syncImmediate() throws every time, the sync point never advances (syncedL1=undefined), and no checkpoint is ingested or rejected.
  • Full archiver-sync.test.ts + validation.test.ts: 68/68 pass. tsc --noEmit clean.

Source: AztecProtocol/aztec-claude#1553


Follow-up: blob-failure liveness hardening (rows 4/5)

The reorder above fixes the case of a checkpoint with invalid attestations + a bad blob. A checkpoint with valid attestations (a ≥2/3 committee bribe) but a malformed or withheld blob still threw during blob decode — passing attestation validation, then aborting the whole syncFromL1 iteration. Because this.l1Timestamp is only advanced at the end of syncFromL1, the L1 sync clock froze, so every honest proposer's checkSync failed and nobody proposed — meaning the epoch-prune recovery never fired. A single bribed-committee + gossip-withholding attack could permanently halt the chain.

Fix (l1_synchronizer.ts, handleCheckpoints): gate the blob fetch/decode failure on rollup.canPruneAtTime.

  • Epoch still provable (cannot prune yet): keep treating the failure as fatal and retry — the blob is canonical and must eventually become available (transient beacon unavailability must not be skipped).
  • Proof window expired (rollup can prune on the next L1 block): the checkpoint is destined for pruning, so skip it and every later checkpoint this iteration and let the existing epoch-prune machinery run. Sync no longer throws, the clock advances, honest proposers resume, prune-on-propose fires on L1, and the bad checkpoint is dropped — automatic recovery.

Only NoBlobBodiesFoundError / BlobDeserializationError are caught; other errors propagate as before.

Tests

  • archiver-sync.test.ts: malformed blob + valid attestations while the epoch can still be proven → still throws; once prunable → does not throw, the checkpoint is skipped, the pending chain rolls back to the proven tip, and the sync clock advances; a matching local proposed checkpoint is still promoted (blob fetch skipped) even when its on-chain blob is malformed.
  • e2e_epochs/epochs_invalidate_block (invalid attestations): an invalid-attestations checkpoint is detected from calldata before any blob fetch and a fresh node syncs past it with its blob withheld from the shared store.
  • e2e_epochs/epochs_blob_unavailable_prune (valid attestations, rows 4/5): a gossip-disabled observer whose copy of a valid-attestations checkpoint's blob is withheld keeps its L1 sync clock frozen while the epoch is still provable, then unfreezes and advances past the checkpoint once its proof window expires (it becomes prunable) — instead of looping on the blob fetch forever.

… (A-1252)

During L1 sync the archiver fetched and decoded checkpoint blob data before
validating committee attestations. A selected proposer could publish a
checkpoint with invalid/insufficient attestations plus malformed-but-hash-
matching blob data; the blob decode threw BlobDeserializationError before the
invalid-attestation skip path ran, so the checkpoint was never recorded as
rejected and the L1 sync point never advanced past it. The archiver then
re-queried the same L1 blocks every poll and re-threw forever, taking any valid
checkpoints in the same batch down with it (sync-liveness / DoS).

Validate attestations from calldata first (the signed consensus payload --
header, archive root, fee asset price modifier -- is fully available without
blobs), reject invalid and descendant-of-rejected checkpoints without fetching
their blobs, then fetch blobs in parallel only for the survivors and ingest.

- validation.ts: add validateCheckpointAttestationsFromCalldata and extract the
  shared core validator.
- l1_synchronizer.ts: reorder handleCheckpoints to validate-then-fetch; track
  lastSeenCheckpoint from calldata since rejected checkpoints are no longer built.
- archiver-sync.test.ts: regression test covering invalid attestations + malformed
  blob, asserting rejection without a BlobDeserializationError retry loop.

Source: AztecProtocol/aztec-claude#1553
@PhilWindle PhilWindle added ci-draft Run CI on draft PRs. ci-full Run all master checks. labels Jun 18, 2026
… be pruned (A-1252)

A checkpoint with valid attestations but a malformed or withheld blob still threw during blob
decode in handleCheckpoints, aborting the whole sync iteration and freezing the L1 sync clock
(this.l1Timestamp is only advanced at the end of syncFromL1). Frozen clock => every honest
proposer's checkSync fails => nobody proposes => the epoch-prune recovery never fires, so a single
bribed-committee + gossip-withholding attack permanently halts the chain.

Gate the blob failure on rollup.canPruneAtTime: while the checkpoint's epoch can still be proven a
blob failure stays fatal (the blob is canonical and must become available, so we keep retrying);
once the proof window has expired the checkpoint is destined for pruning, so we skip it and every
later checkpoint this iteration and let the existing epoch-prune machinery recover. Sync no longer
throws, the clock advances, honest proposers resume, prune-on-propose fires on L1, and the bad
checkpoint is dropped.

Adds unit tests for the prunable/non-prunable boundary and confirms a matching local proposed
checkpoint is still promoted (blob fetch skipped) even when its on-chain blob is malformed.
…esh node syncs past withheld blob (A-1252)

Adds an e2e to epochs_invalidate_block: all sequencers post a checkpoint with insufficient
attestations, then revert to honest config so it is invalidated. Asserts an already-running honest
node detects the bad checkpoint from L1 calldata (InvalidAttestationsCheckpointDetected) before any
blob fetch, and that a fresh observer node started afterwards — with the bad checkpoint's blob
deleted from the shared file store and no gossiped copy to promote — still syncs past it rather than
looping on a blob-fetch error.
…zes sync clock once prunable (A-1252)

Adds the rows-4/5 counterpart to the invalid-attestations e2e: a checkpoint with valid attestations
but a withheld blob cannot be rejected by attestation validation, so a node must fetch its blob to
ingest it. Using a short proof window and no prover, the test freezes the chain on such a checkpoint,
deletes its blob from the shared store, and starts a gossip-disabled observer that never promotes.
The observer's L1 sync clock stays frozen (getSyncedL1Timestamp undefined) while the epoch is still
provable, then advances past the checkpoint once the proof window expires and it becomes prunable —
the clock-unfreeze the fix restores. Without the fix the observer would loop on the blob fetch forever.
…hable-blob checkpoint (A-1252)

Extends the rows-4/5 e2e: after the observer's sync clock unfreezes (it skipped the unfetchable
checkpoint once prunable), production is resumed. The next proposer prunes the doomed unproven epoch
on L1 and the chain rebuilds; every node — all validators plus the previously-stuck observer —
progresses past the bad checkpoint. Because the observer never ingested that checkpoint, it can only
get past it if the chain actually prunes (a descendant would otherwise be unservable to it), so this
also asserts the prune occurred. Freezing/resuming is done via minTxsPerBlock + buildCheckpointIfEmpty
toggles rather than stopping the sequencer (which closes its store and cannot be cleanly restarted).
…kpoint while canonical (A-1252)

Reworks the two blob-unavailability e2e tests to be deterministic and to exercise the intended paths:

- Production is driven by feeding txs (minTxsPerBlock 1, empty checkpoints disabled) rather than racing
  sequencer-config changes against a stream of empty checkpoints, so we control exactly which checkpoint
  is bad and the chain stays frozen on it once we stop feeding txs.
- The invalid-attestations test now keeps the bad checkpoint CANONICAL (invalidation disabled on the
  sequencers) so the late, no-promote observer rejects it via the L1 calldata attestation check — not the
  archive-mismatch filter, which only drops already-invalidated/replaced checkpoints. With its blob
  withheld, the observer's sync clock still advances past it, proving it never fetched the blob. The chain
  then invalidates and rebuilds, and every node (validators + observer) converges past it.
- The rows-4/5 test keeps the same shape: a valid-attestations checkpoint with a withheld blob freezes a
  no-promote observer's clock until its proof window expires, after which it skips the checkpoint and the
  clock advances; resuming production prunes the doomed epoch and all nodes converge.

Asserting on the sync clock (a persistent signal) rather than the transient detection event avoids a
flaky race where a late observer detects the checkpoint during its initial sync before a listener attaches.
…ipped prunable one is not ingested (A-1252)

Adds a unit test exercising the handleCheckpoints loop-break (not just the in-batch filter): with a tiny
batch size, a malformed-blob checkpoint (CP2, prunable) and a good-blob checkpoint that chains off it
(CP3) fall in separate L1-block batches. The fix must skip CP3 along with CP2; without the break,
addCheckpoints hits InitialCheckpointNumberNotSequentialError on the gap and re-freezes sync. Verified
red without the break. Widens the buildArchiver test helper to accept a batchSize override.
@PhilWindle PhilWindle marked this pull request as ready for review June 22, 2026 12:34

@spalladino spalladino left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but the one change I'd make is not skipping a checkpoint if its blob is unavailable. We can discuss during standup.

expect(checkpointedBlocks[0].checkpointNumber).toEqual(2);
}, 10_000);

it('promotes a matching local checkpoint even when its on-chain blob is malformed', async () => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence on whether this is a good idea. I agree we should promote if blob data is unreachable. But if it's incorrect, shouldn't we fail loudly? It means the on-chain data is corrupt.

No need to change on this PR though, since it's how it worked before.

Comment on lines +1202 to 1205
if (stopAfterBatch) {
break;
}
} while (searchEndBlock < currentL1BlockNumber);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (stopAfterBatch) {
break;
}
} while (searchEndBlock < currentL1BlockNumber);
} while (searchEndBlock < currentL1BlockNumber && !stopAfterBatch);

My OCD would appreciate it

Comment on lines +64 to +66
* Outcome of fetching and building a single checkpoint from its blobs. A blob that is missing or
* undecodable yields a `blobError` sentinel instead of rejecting the whole fetch pool, so the caller
* can decide in checkpoint order whether the failure is fatal. See A-1252 (rows 4/5).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR: should we be referencing non-public issues in the code? I'm not sure: on one hand it's useful for us to trace back why we did something, on the other it pollutes the comments for anyone without access to it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that shouldn't be referencing the ticket like that. I had cleaned up other occurences but missed that, thanks.

Comment on lines +1011 to +1014
// A blob fetch/decode failure is only fatal while the checkpoint's epoch can still be proven. Once the
// proof-submission window has expired (the rollup can prune on the next L1 block), the checkpoint is
// destined for pruning, so we stop treating it as fatal: we skip it (and every later checkpoint) and
// let the epoch-prune recovery proceed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree here. A blob decode failure is fatal, but not a fetch. If our consensus client is down for whatever reason, we can't just skip ahead. That'll break things down the road.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

Comment on lines +1022 to +1030
if (!(await this.canPrune(currentL1BlockNumber, currentL1Timestamp))) {
// The checkpoint is canonical and may still be proven, so the blob must eventually become
// available. Rethrow to retry on the next iteration.
this.log.error(
`Failed to fetch blob for checkpoint ${failedNumber} whose epoch can still be proven; will retry`,
{ checkpointNumber: failedNumber, l1BlockNumber: firstBlobFailure.checkpoint.l1.blockNumber },
);
throw firstBlobFailure.blobError;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check that the failed checkpoint is prunable? Otherwise this can trigger for historical sync.

Comment on lines +1196 to +1199
lastSeenCheckpoint = {
checkpointNumber: lastCalldataCheckpoint.checkpointNumber,
l1: lastCalldataCheckpoint.l1,
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lastSeenCheckpoint = {
checkpointNumber: lastCalldataCheckpoint.checkpointNumber,
l1: lastCalldataCheckpoint.l1,
};
lastSeenCheckpoint = lastCalldataCheckpoint;

Feels cleaner

Comment on lines +1051 to +1052
// Resume honest production AND re-enable proposer invalidation: a proposer invalidates the bad
// checkpoint and the chain rebuilds; every node — validators and the observer — progresses past it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is testing that validators that have already synced (since the blob was deleted potentially after all validators synced it) the checkpoint with the invalid blob can effectively prune it. We should test that a validator that did not sync it can prune it.

To make the test easier, I'd add yet another test-only flag to the sequencer for publishing garbage in its blobs.

logger.info(`Node checkpoint tips: ${tips.join(', ')} (target > ${badCheckpointNumber})`);
return tips.every(n => n > badCheckpointNumber);
},
'chain invalidates the bad checkpoint and every node (incl. the observer) progresses past it',

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not checking that the bad checkpoint was actually invalidated, it's only checking the chain advanced past it. If validators did sync it (eg via promotion) then new blocks would've been produced, though I agree that the observer syncing past it seems like a sign that the chain did heal.

Comment on lines +1138 to +1142
// Resume production: the next proposer prunes the doomed unproven epoch on L1 (prune-on-propose, since
// its proof window has expired) and the chain rebuilds. Every node — validators and the observer that
// skipped the unfetchable checkpoint — must progress past it. This also implicitly asserts the prune
// happened: had the chain instead built on top of the bad checkpoint, the observer (which never
// ingested it) could not ingest any descendant, so it could never get past badCheckpointNumber.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, we should check that proposers that never got to sync to invalid checkpoint can actually prune it.

This also implicitly asserts the prune happened: had the chain instead built on top of the bad checkpoint, the observer (which never ingested it)

Note we are not testing that the observer never ingested it. We should be explicit in these assertions.

@spalladino

Copy link
Copy Markdown
Contributor

Re-scoped replacement opened as #24247 (fresh branch off merge-train/spartan-v5).

Per review discussion, #24247 keeps only the calldata-first attestation validation ("don't download blobs for a checkpoint with wrong attestations") and drops the blob-failure skipping machinery (the rows 4/5 work). A blob fetch/decode failure for an attestation-valid checkpoint stays fatal/retried, as before this PR. The deferred blob-unavailability/liveness handling (plus the test-only garbage-blob sequencer flag and the never-synced-prune e2e) is tracked in A-1260.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-v4 ci-draft Run CI on draft PRs. ci-full Run all master checks.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants