Skip to content

feat(bench): schema v4 + scrape proving-infra and per-role saturation#24260

Open
PhilWindle wants to merge 1 commit into
merge-train/spartan-v5from
phil/bench-schema-v4-scraper
Open

feat(bench): schema v4 + scrape proving-infra and per-role saturation#24260
PhilWindle wants to merge 1 commit into
merge-train/spartan-v5from
phil/bench-schema-v4-scraper

Conversation

@PhilWindle

Copy link
Copy Markdown
Collaborator

Phase 1 (A-1221) and Phase 2 (A-1222) of the nightly benchmarking revamp. Ignoring the dashboard work for now; this lands the schema + scraper so proving-infra and saturation data start flowing.

Phase 1 — schema v4 + spec (A-1221)

  • bench_output.schema.json → v4, fully additive (all v3 fields retained; a v3-shaped run re-stamped "4" still validates):
    • new optional provingInfra and saturation sections (metricSeriesMap = open slug→timeSeries map),
    • run.sweepId / run.sweepLabel so a night's 1/5/10 TPS points group as one sweep.
  • Version gate bumped to "4" in the schema and in bootstrap.sh's network_bench_upload check; index.json entries now carry sweepId/sweepLabel. The third gate — network-dashboard/data.js SUPPORTED_RUN_VERSION — lives in AztecProtocol/explorations and must be bumped there before v4 runs render (dashboard/Phase 5 work).
  • 10tps-readiness-spec.md: canonical spec — the 9-stage tx-lifecycle waterfall mapped to Prometheus metrics, headline KPIs + pass/fail thresholds, and the sweep/run-group notion.

Phase 2 — scraper (A-1222)

bench_scrape.ts now emits, independently of the inclusion scrape:

  • provingInfra: prover-node hint-gen (public_processor.* + prover_node block/checkpoint processing, scoped to the prover-node pod) and proving-queue series broken down by aztec_proving_job_type (size / active / job_duration p50·p99 / timed-out · resolved rates).
  • saturation: per-role ELU / CPU / memory, each as max (hottest pod) and avg, for validator / rpc / fullNode / proverNode / broker / agent — never a single hand-picked pod.

Notes for review

  • There is no aztec.prover_node.execution.duration metric (the spec named it); hint-gen is the public_processor.* re-execution on the prover-node pod, mapped accordingly.
  • ELU (nodejs_eventloop_utilization) and especially CPU (process_cpu_utilization, from @opentelemetry/host-metrics) may be telemetry-gated in the bench env — the scraper emits empty series (non-fatal) if absent. Flagged in the spec for live verification (A-1222 acceptance); this is the one item that needs a live bench run to close.

Validation

Schema compiles under ajv-2020; a full v4 payload (with the new sections + sweep fields) and a v3-shaped body re-stamped v4 both validate. Scraper type-strips/loads cleanly.

Phase 1 (A-1221) + Phase 2 (A-1222) of the nightly benchmarking revamp.

Schema v4 (additive over v3, all v3 fields retained):
- New optional sections provingInfra and saturation (metricSeriesMap: open
  slug -> timeSeries map), plus run.sweepId/sweepLabel so a night's 1/5/10 TPS
  points group as one sweep.
- Version gate bumped to "4" in bench_output.schema.json and the
  network_bench_upload check in bootstrap.sh; index.json entry now carries
  sweepId/sweepLabel. (Dashboard SUPPORTED_RUN_VERSION lives in
  AztecProtocol/explorations and must be bumped there separately.)
- 10tps-readiness-spec.md: the canonical spec — tx-lifecycle stage waterfall
  mapped to metrics, headline KPIs + thresholds, sweep/run-group notion.

Scraper (bench_scrape.ts):
- provingInfra: prover-node hint-gen (public_processor.* + prover_node
  block/checkpoint processing, scoped to the prover-node pod) and proving-queue
  series broken down by aztec_proving_job_type (size/active/job_duration/
  timed-out/resolved). There is no prover_node.execution.duration metric; the
  re-execution is the public_processor.* path, mapped accordingly.
- saturation: per-role ELU/CPU/memory as max (hottest pod) AND avg, never a
  single hand-picked pod. ELU/mem from nodejs.*, CPU from host-metrics.
- Sections scrape independently so one failure cannot drop the others.

CPU and ELU may be telemetry-gated in the bench env (empty series, non-fatal);
flagged in the spec for live verification per A-1222 acceptance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant