fix(ci): give AVM check-circuit more CPU/time for heavy txs (canonical)#24234
Draft
AztecBot wants to merge 1 commit into
Draft
fix(ci): give AVM check-circuit more CPU/time for heavy txs (canonical)#24234AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Raise the per-command resource budget for
avm_check_circuitinyarn-project/end-to-end/bootstrap.shfromCPUS=2 (default) / TIMEOUT=30stoCPUS=4 / TIMEOUT=120s.Why
The "AVM Circuit Inputs Collection and Check" workflow (
avm-check-circuitjob) has been failing repeatedly onnext. The GitHub job exits124, which propagates up from a single check-circuit invocation that blows the per-tx timeout while every other input passes in 4–11s. This is a clean wall-clock timeout (exit 124) — not a circuit/assertion error and not OOM (peak ~3.9 GiB inside an 8 GiB container;dmesgshowed no kill).avm_check_circuit_cmdsfans out up to 96bb-avm avm_check_circuitjobs in parallel viaparallelize. The heaviest e2e txs build large AVM circuits (~700k rows); trace generation + check-circuit for them does not fit in the long-standing 30s/2-CPU budget (in place since #18747), especially under that parallel CPU contention. The existing code comment already anticipated exactly this ("transactions could need more CPU and MEM than we allocate by default … they might start timing out"). These failures are unrelated to whatever commit happens to be at the head of the failing run —avm_check_circuitis standalonebb-avmreading a dumped.bin.Observed timeouts span multiple heavy txs, confirming this is not a single-tx fluke:
e2e_multiple_blobstx0x242e67c9…— 34–35s (run 27860137759, and prior run 27805430170)e2e_multiple_blobstx0x23d0ab72…— 33s (run 28024173324)e2e_block_buildingtx0x0737b400…— 39s (run 28034181648)Fix
Both trace generation and check-circuit are multithreaded, so the bottleneck is the 2-CPU cap as much as the 30s clock. Bump to
CPUS=4(which also raises the derivedMEMto 16 GiB viaMEM=CPUS*4 g) andTIMEOUT=120sfor generous headroom on the heaviest txs and on a loaded runner. One-line prefix update plus a refreshed comment. No code/circuit behavior changes; this only adjusts CI execution resources.Validation note
avm-circuit-inputs.ymltriggers only on push tonext, the nightly cron, andworkflow_dispatch— it does not run on pull requests. So this change cannot be validated by PR CI; it takes effect (and is validated by the next push/nightly run) once merged tonext, or via a manualworkflow_dispatchon this branch.Supersedes
This is the canonical consolidation of the duplicate timeout-bump PRs opened by successive failure auto-dispatches. It strictly dominates them (more CPU and a larger global timeout, covering every heavy tx — not just
e2e_multiple_blobs):e2e_multiple_blobsonly — would not cover thee2e_block_buildingtimeout)Closing those four in favor of this PR.