feat(bench): schema v4 + scrape proving-infra and per-role saturation#24260
Open
PhilWindle wants to merge 1 commit into
Open
feat(bench): schema v4 + scrape proving-infra and per-role saturation#24260PhilWindle wants to merge 1 commit into
PhilWindle wants to merge 1 commit into
Conversation
Phase 1 (A-1221) + Phase 2 (A-1222) of the nightly benchmarking revamp. Schema v4 (additive over v3, all v3 fields retained): - New optional sections provingInfra and saturation (metricSeriesMap: open slug -> timeSeries map), plus run.sweepId/sweepLabel so a night's 1/5/10 TPS points group as one sweep. - Version gate bumped to "4" in bench_output.schema.json and the network_bench_upload check in bootstrap.sh; index.json entry now carries sweepId/sweepLabel. (Dashboard SUPPORTED_RUN_VERSION lives in AztecProtocol/explorations and must be bumped there separately.) - 10tps-readiness-spec.md: the canonical spec — tx-lifecycle stage waterfall mapped to metrics, headline KPIs + thresholds, sweep/run-group notion. Scraper (bench_scrape.ts): - provingInfra: prover-node hint-gen (public_processor.* + prover_node block/checkpoint processing, scoped to the prover-node pod) and proving-queue series broken down by aztec_proving_job_type (size/active/job_duration/ timed-out/resolved). There is no prover_node.execution.duration metric; the re-execution is the public_processor.* path, mapped accordingly. - saturation: per-role ELU/CPU/memory as max (hottest pod) AND avg, never a single hand-picked pod. ELU/mem from nodejs.*, CPU from host-metrics. - Sections scrape independently so one failure cannot drop the others. CPU and ELU may be telemetry-gated in the bench env (empty series, non-fatal); flagged in the spec for live verification per A-1222 acceptance.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 1 (A-1221) and Phase 2 (A-1222) of the nightly benchmarking revamp. Ignoring the dashboard work for now; this lands the schema + scraper so proving-infra and saturation data start flowing.
Phase 1 — schema v4 + spec (A-1221)
bench_output.schema.json→ v4, fully additive (all v3 fields retained; a v3-shaped run re-stamped"4"still validates):provingInfraandsaturationsections (metricSeriesMap= open slug→timeSeries map),run.sweepId/run.sweepLabelso a night's 1/5/10 TPS points group as one sweep."4"in the schema and inbootstrap.sh'snetwork_bench_uploadcheck;index.jsonentries now carrysweepId/sweepLabel. The third gate —network-dashboard/data.jsSUPPORTED_RUN_VERSION— lives inAztecProtocol/explorationsand must be bumped there before v4 runs render (dashboard/Phase 5 work).10tps-readiness-spec.md: canonical spec — the 9-stage tx-lifecycle waterfall mapped to Prometheus metrics, headline KPIs + pass/fail thresholds, and the sweep/run-group notion.Phase 2 — scraper (A-1222)
bench_scrape.tsnow emits, independently of the inclusion scrape:provingInfra: prover-node hint-gen (public_processor.*+prover_nodeblock/checkpoint processing, scoped to the prover-node pod) and proving-queue series broken down byaztec_proving_job_type(size / active / job_duration p50·p99 / timed-out · resolved rates).saturation: per-role ELU / CPU / memory, each as max (hottest pod) and avg, for validator / rpc / fullNode / proverNode / broker / agent — never a single hand-picked pod.Notes for review
aztec.prover_node.execution.durationmetric (the spec named it); hint-gen is thepublic_processor.*re-execution on the prover-node pod, mapped accordingly.nodejs_eventloop_utilization) and especially CPU (process_cpu_utilization, from@opentelemetry/host-metrics) may be telemetry-gated in the bench env — the scraper emits empty series (non-fatal) if absent. Flagged in the spec for live verification (A-1222 acceptance); this is the one item that needs a live bench run to close.Validation
Schema compiles under ajv-2020; a full v4 payload (with the new sections + sweep fields) and a v3-shaped body re-stamped v4 both validate. Scraper type-strips/loads cleanly.