aztec.js: waitForTx / waitForL1ToL2MessageReady abort on transient RPC errors (retryUntil doesn't retry thrown errors)

## Summary

`waitForTx` and `waitForL1ToL2MessageReady` abort with a thrown error when a **transient** RPC failure (e.g. an intermittent `502 Bad Gateway` from a public node) occurs during their polling loop — even though the operation actually succeeded and the very next poll would have observed it. The culprit is that `retryUntil` only retries on a falsy *return* from its predicate; a *thrown* error propagates straight out.

Affected: `@aztec/aztec.js@4.3.0` (and `@aztec/foundation@4.3.0`).

## Root cause

`@aztec/foundation` `retryUntil`:

```js
export async function retryUntil(fn, name = '', timeout = 0, interval = 1) {
  const timer = new Timer();
  while (true) {
    const result = await fn();   // <-- no try/catch
    if (result) return result;
    await sleep(interval * 1000);
    if (timeout && timer.s() > timeout) throw new TimeoutError(...);
  }
}
```

The predicates passed in by the wait helpers call node reads that can throw on transient RPC failures:

- `waitForTx` (`aztec.js/src/utils/node.ts`) → `node.getTxReceipt(txHash)`
- `waitForL1ToL2MessageReady` → `isL1ToL2MessageReady` (`aztec.js/src/utils/cross_chain.ts`) → `node.getL1ToL2MessageCheckpoint(...)` and `node.getBlock('latest')`

When the node returns a 502/503/504 (common with load-balanced public RPCs), the read rejects, the rejection propagates out of `retryUntil`, and the whole wait fails.

## Impact

A single gateway blip during an otherwise-successful transaction fails the wait, making callers believe the tx failed when it was mined. Observed against `rpc.testnet.aztec-labs.com`: a tx was sent and included in a block, but `waitForTx` threw `Bad Gateway` ~90s after `Sent transaction`. Querying `node_getTxReceipt` directly afterwards showed `executionResult: success` / `checkpointed`. This makes unattended flows (deploys, registration, smoke tests) flaky against any public/HA endpoint, and is easy to mistake for a logic failure.

Note: a *reverted* tx is reported via a returned receipt (`hasExecutionSucceeded() === false`), **not** via a thrown error from `getTxReceipt`. So a throw from `getTxReceipt`/`getBlock`/`getL1ToL2MessageCheckpoint` is always an infra/transport failure and is safe to retry — retrying does not risk masking a real revert.

## Proposed fix

Make the polling predicates resilient to transient throws. Options, roughly in order of preference:

1. Catch transient errors inside `waitForTx` / `isL1ToL2MessageReady` and treat them as "not ready yet" (return `undefined`) so `retryUntil` keeps polling until the existing timeout.
2. Add a `retryUntil` variant (or option) that treats thrown errors as retryable, ideally with a caller-supplied `isRetryable(err)` predicate and a consecutive-failure cap so a permanently-down node still surfaces.
3. Wrap the node client's idempotent read methods with bounded retry-on-transient-error.

Happy to open a PR if there's a preferred shape.

## Repro sketch

```ts
// Against an endpoint that intermittently 502s (or a proxy that injects a 502
// for one getTxReceipt call), send any tx and await it:
const { txHash } = await wallet.sendTx(payload, { from, fee, wait: 'NO_WAIT' });
await waitForTx(node, txHash); // throws "Bad Gateway" if a poll hits the 502,
                              // even though the tx is mined.
```

## Environment

- `@aztec/aztec.js`, `@aztec/foundation`, `@aztec/stdlib`, `@aztec/wallet-sdk`: `4.3.0`
- Node `v24.12.0`
- Endpoint: `rpc.testnet.aztec-labs.com` (public testnet)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aztec.js: waitForTx / waitForL1ToL2MessageReady abort on transient RPC errors (retryUntil doesn't retry thrown errors) #23546

Summary

Root cause

Impact

Proposed fix

Repro sketch

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

aztec.js: waitForTx / waitForL1ToL2MessageReady abort on transient RPC errors (retryUntil doesn't retry thrown errors) #23546

Description

Summary

Root cause

Impact

Proposed fix

Repro sketch

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions