Skip to content

aztec.js: waitForTx / waitForL1ToL2MessageReady abort on transient RPC errors (retryUntil doesn't retry thrown errors) #23546

Description

@just-mitch

Summary

waitForTx and waitForL1ToL2MessageReady abort with a thrown error when a transient RPC failure (e.g. an intermittent 502 Bad Gateway from a public node) occurs during their polling loop — even though the operation actually succeeded and the very next poll would have observed it. The culprit is that retryUntil only retries on a falsy return from its predicate; a thrown error propagates straight out.

Affected: @aztec/aztec.js@4.3.0 (and @aztec/foundation@4.3.0).

Root cause

@aztec/foundation retryUntil:

export async function retryUntil(fn, name = '', timeout = 0, interval = 1) {
  const timer = new Timer();
  while (true) {
    const result = await fn();   // <-- no try/catch
    if (result) return result;
    await sleep(interval * 1000);
    if (timeout && timer.s() > timeout) throw new TimeoutError(...);
  }
}

The predicates passed in by the wait helpers call node reads that can throw on transient RPC failures:

  • waitForTx (aztec.js/src/utils/node.ts) → node.getTxReceipt(txHash)
  • waitForL1ToL2MessageReadyisL1ToL2MessageReady (aztec.js/src/utils/cross_chain.ts) → node.getL1ToL2MessageCheckpoint(...) and node.getBlock('latest')

When the node returns a 502/503/504 (common with load-balanced public RPCs), the read rejects, the rejection propagates out of retryUntil, and the whole wait fails.

Impact

A single gateway blip during an otherwise-successful transaction fails the wait, making callers believe the tx failed when it was mined. Observed against rpc.testnet.aztec-labs.com: a tx was sent and included in a block, but waitForTx threw Bad Gateway ~90s after Sent transaction. Querying node_getTxReceipt directly afterwards showed executionResult: success / checkpointed. This makes unattended flows (deploys, registration, smoke tests) flaky against any public/HA endpoint, and is easy to mistake for a logic failure.

Note: a reverted tx is reported via a returned receipt (hasExecutionSucceeded() === false), not via a thrown error from getTxReceipt. So a throw from getTxReceipt/getBlock/getL1ToL2MessageCheckpoint is always an infra/transport failure and is safe to retry — retrying does not risk masking a real revert.

Proposed fix

Make the polling predicates resilient to transient throws. Options, roughly in order of preference:

  1. Catch transient errors inside waitForTx / isL1ToL2MessageReady and treat them as "not ready yet" (return undefined) so retryUntil keeps polling until the existing timeout.
  2. Add a retryUntil variant (or option) that treats thrown errors as retryable, ideally with a caller-supplied isRetryable(err) predicate and a consecutive-failure cap so a permanently-down node still surfaces.
  3. Wrap the node client's idempotent read methods with bounded retry-on-transient-error.

Happy to open a PR if there's a preferred shape.

Repro sketch

// Against an endpoint that intermittently 502s (or a proxy that injects a 502
// for one getTxReceipt call), send any tx and await it:
const { txHash } = await wallet.sendTx(payload, { from, fee, wait: 'NO_WAIT' });
await waitForTx(node, txHash); // throws "Bad Gateway" if a poll hits the 502,
                              // even though the tx is mined.

Environment

  • @aztec/aztec.js, @aztec/foundation, @aztec/stdlib, @aztec/wallet-sdk: 4.3.0
  • Node v24.12.0
  • Endpoint: rpc.testnet.aztec-labs.com (public testnet)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions