Summary
waitForTx and waitForL1ToL2MessageReady abort with a thrown error when a transient RPC failure (e.g. an intermittent 502 Bad Gateway from a public node) occurs during their polling loop — even though the operation actually succeeded and the very next poll would have observed it. The culprit is that retryUntil only retries on a falsy return from its predicate; a thrown error propagates straight out.
Affected: @aztec/aztec.js@4.3.0 (and @aztec/foundation@4.3.0).
Root cause
@aztec/foundation retryUntil:
export async function retryUntil(fn, name = '', timeout = 0, interval = 1) {
const timer = new Timer();
while (true) {
const result = await fn(); // <-- no try/catch
if (result) return result;
await sleep(interval * 1000);
if (timeout && timer.s() > timeout) throw new TimeoutError(...);
}
}
The predicates passed in by the wait helpers call node reads that can throw on transient RPC failures:
waitForTx (aztec.js/src/utils/node.ts) → node.getTxReceipt(txHash)
waitForL1ToL2MessageReady → isL1ToL2MessageReady (aztec.js/src/utils/cross_chain.ts) → node.getL1ToL2MessageCheckpoint(...) and node.getBlock('latest')
When the node returns a 502/503/504 (common with load-balanced public RPCs), the read rejects, the rejection propagates out of retryUntil, and the whole wait fails.
Impact
A single gateway blip during an otherwise-successful transaction fails the wait, making callers believe the tx failed when it was mined. Observed against rpc.testnet.aztec-labs.com: a tx was sent and included in a block, but waitForTx threw Bad Gateway ~90s after Sent transaction. Querying node_getTxReceipt directly afterwards showed executionResult: success / checkpointed. This makes unattended flows (deploys, registration, smoke tests) flaky against any public/HA endpoint, and is easy to mistake for a logic failure.
Note: a reverted tx is reported via a returned receipt (hasExecutionSucceeded() === false), not via a thrown error from getTxReceipt. So a throw from getTxReceipt/getBlock/getL1ToL2MessageCheckpoint is always an infra/transport failure and is safe to retry — retrying does not risk masking a real revert.
Proposed fix
Make the polling predicates resilient to transient throws. Options, roughly in order of preference:
- Catch transient errors inside
waitForTx / isL1ToL2MessageReady and treat them as "not ready yet" (return undefined) so retryUntil keeps polling until the existing timeout.
- Add a
retryUntil variant (or option) that treats thrown errors as retryable, ideally with a caller-supplied isRetryable(err) predicate and a consecutive-failure cap so a permanently-down node still surfaces.
- Wrap the node client's idempotent read methods with bounded retry-on-transient-error.
Happy to open a PR if there's a preferred shape.
Repro sketch
// Against an endpoint that intermittently 502s (or a proxy that injects a 502
// for one getTxReceipt call), send any tx and await it:
const { txHash } = await wallet.sendTx(payload, { from, fee, wait: 'NO_WAIT' });
await waitForTx(node, txHash); // throws "Bad Gateway" if a poll hits the 502,
// even though the tx is mined.
Environment
@aztec/aztec.js, @aztec/foundation, @aztec/stdlib, @aztec/wallet-sdk: 4.3.0
- Node
v24.12.0
- Endpoint:
rpc.testnet.aztec-labs.com (public testnet)
Summary
waitForTxandwaitForL1ToL2MessageReadyabort with a thrown error when a transient RPC failure (e.g. an intermittent502 Bad Gatewayfrom a public node) occurs during their polling loop — even though the operation actually succeeded and the very next poll would have observed it. The culprit is thatretryUntilonly retries on a falsy return from its predicate; a thrown error propagates straight out.Affected:
@aztec/aztec.js@4.3.0(and@aztec/foundation@4.3.0).Root cause
@aztec/foundationretryUntil:The predicates passed in by the wait helpers call node reads that can throw on transient RPC failures:
waitForTx(aztec.js/src/utils/node.ts) →node.getTxReceipt(txHash)waitForL1ToL2MessageReady→isL1ToL2MessageReady(aztec.js/src/utils/cross_chain.ts) →node.getL1ToL2MessageCheckpoint(...)andnode.getBlock('latest')When the node returns a 502/503/504 (common with load-balanced public RPCs), the read rejects, the rejection propagates out of
retryUntil, and the whole wait fails.Impact
A single gateway blip during an otherwise-successful transaction fails the wait, making callers believe the tx failed when it was mined. Observed against
rpc.testnet.aztec-labs.com: a tx was sent and included in a block, butwaitForTxthrewBad Gateway~90s afterSent transaction. Queryingnode_getTxReceiptdirectly afterwards showedexecutionResult: success/checkpointed. This makes unattended flows (deploys, registration, smoke tests) flaky against any public/HA endpoint, and is easy to mistake for a logic failure.Note: a reverted tx is reported via a returned receipt (
hasExecutionSucceeded() === false), not via a thrown error fromgetTxReceipt. So a throw fromgetTxReceipt/getBlock/getL1ToL2MessageCheckpointis always an infra/transport failure and is safe to retry — retrying does not risk masking a real revert.Proposed fix
Make the polling predicates resilient to transient throws. Options, roughly in order of preference:
waitForTx/isL1ToL2MessageReadyand treat them as "not ready yet" (returnundefined) soretryUntilkeeps polling until the existing timeout.retryUntilvariant (or option) that treats thrown errors as retryable, ideally with a caller-suppliedisRetryable(err)predicate and a consecutive-failure cap so a permanently-down node still surfaces.Happy to open a PR if there's a preferred shape.
Repro sketch
Environment
@aztec/aztec.js,@aztec/foundation,@aztec/stdlib,@aztec/wallet-sdk:4.3.0v24.12.0rpc.testnet.aztec-labs.com(public testnet)