linalg/qr: benchmark heterogeneous & ill-conditioned batches#149
linalg/qr: benchmark heterogeneous & ill-conditioned batches#149robobryce wants to merge 1 commit into
Conversation
The QR benchmark set ranked only well-conditioned dense batches (cond 1-2), while the ill-conditioned stress structures (rankdef, clustered, band, rowscale, nearcollinear, nearrank) appeared only in the correctness tests. Worse, every batch was conditioning-homogeneous: generate_input applied one structure uniformly to all `batch` matrices. Together these let a submission read a few matrices, conclude "the whole batch is well-conditioned," and route the entire batch to a TF32/Cholesky fast path that is only numerically valid for well-conditioned inputs -- winning the ranked (all-dense) cases while the unranked stress cases were the only thing that could have exposed the shortcut. On a realistic batch (per-layer / per- block optimizer factors with varying conditioning, in random positions) such a kernel is either wrong or silently falls back, but the benchmark never built one. This change makes conditioning robustness part of the score, not just a gate: - reference.py: add a `mixed` case that assigns each matrix in the batch an independent conditioning profile (well-conditioned dense majority interleaved with the ill-conditioned structures) at a random, seeded position. The per-case logic is factored into `_apply_case` and reused; existing homogeneous cases produce bit-for-bit identical data (verified on CPU), so prior leaderboard results are unaffected. - task.yml: add `mixed` cases to the tests AND the benchmarks, plus fully ill-conditioned homogeneous batches (rankdef/clustered/nearrank) at the dominant benchmark shapes, so the runtime cost of the accurate path on hard inputs is ranked too. The reference `torch.geqrf` passes the checker on all new cases with wide margin (scaled factor residual 0.002-0.015 vs gate 20; orthogonality 0.17-0.32 vs gate 100), so the problem stays well-posed. New benchmark cases reuse existing shapes, so the memory/timeout envelope is unchanged. Not yet validated on B200 (authored while the target GPU was busy); needs a benchmark/leaderboard run to confirm timings and timeouts. Co-Authored-By: Claude <noreply@anthropic.com>
|
Yeah, I think you should turn recheck mode on for benchmark mode in a separate PR. Please measure the wall clock time of running Benchmark and Leaderboard before and after this change. |
|
Addressed both points, and validated on the B200. 1. 2. Wall-clock before/after. Measured on the B200 with the
The added cost is the 5 new cases, each of which exercises the accurate (slower-than-TF32) path that the new ranking is meant to reward; per-case timings for the 7 pre-existing cases are unchanged within run-to-run noise (e.g. the dominant n=512/b=640 shape: 1075.2M ns before vs 1076.5M ns after). All runs reported Validation. Marking ready for review. |
Summary
The
linalg/qr_pybenchmark ranks submissions only on well-conditioned dense batches, while every ill-conditioned input lives in the correctness tests — and every generated batch is conditioning-homogeneous. That combination rewards a kernel that detects "this batch is well-conditioned" and routes the whole batch to a fast path (TF32 / CholeskyQR) that is only numerically valid for well-conditioned inputs. This PR makes conditioning robustness part of the score, not just a gate.The problem
Looking at the current
task.yml:benchmarks:(what determines ranking) are allcase: dense,cond1–2 — uniformly well-conditioned.tests:(correctness only) are whererankdef,clustered,band,rowscale,nearcollinear,nearranklive.And in
reference.py,generate_inputapplies eachcaseto the entire batch by broadcasting (a * scales,a * _band_mask(...),a[:, :, rank:] = 0, …) — so allbatchmatrices in a call share one conditioning structure.A submission can therefore:
Such a kernel wins the ranked cases, and the only inputs that could expose it (the stress cases) are never benchmarked — and, being homogeneous, never even mix a hidden ill-conditioned matrix into an otherwise-dense batch. This doesn't match the stated workload either: the description targets optimizer-style statistics (
G @ G.T, per-layer/per-block factors), where the matrices batched into one call have widely varying conditioning at arbitrary positions, not one shared structure.(Observed in practice: an autocuda optimization run on this problem converged on exactly this batch-level conditioning probe — sample the first few matrices, route the batch — which is the rational response to the benchmark as written. The kernel isn't cheating; the benchmark under-specifies the distribution.)
The fix
reference.py— add amixedcase: each matrix in the batch is independently assigned a conditioning profile (a well-conditioned dense majority, ~50%, interleaved with the existing ill-conditioned structures) at a random, seeded position. The per-case logic is factored into a shared_apply_casehelper; themixedpath scatters per-profile sub-batches by atorch.multinomiallabel vector, with a guarantee that tiny batches still contain both a well- and an ill-conditioned matrix. Determinism is preserved (seeded generator), and existing homogeneous cases produce bit-for-bit identical data (the baserandnis still drawn first, then the same case extras in the same order — verified on CPU), so prior leaderboard results for the existing cases are unaffected.task.yml— make conditioning robustness ranked:mixedto bothtests:andbenchmarks:.rankdef,clustered,nearrank) tobenchmarks:at the dominant shapes, so the runtime cost of the accurate path on hard inputs is part of the geomean.mixedand the intent.With a heterogeneous batch present, a "sample-a-few-and-route-the-batch" kernel is either wrong (fast-paths a hidden ill-conditioned matrix → fails the checker, which runs unconditionally on the warm-up output) or conservatively slow (routes the whole batch to the exact path). Either way the shortcut no longer yields a free ranking win.
Validation done
I could not run on a B200 (the target GPU was busy with another job), so this was validated on CPU against an unmodified copy of upstream
reference.py:densepath producetorch.equaldata before/after the refactor.mixedis correct: deterministic per seed, distinct across seeds, finite, correct shape; ~60–70% well- / ~30–40% ill-conditioned by the same per-column-spread + sparsity metric a routing kernel would use, with ill-conditioned matrices at random interleaved positions (e.g. 2, 4, 5, 7, 9, …) rather than clustered at the front.torch.geqrfpassescheck_implementationon every new case with wide margin (scaled factor residual 0.002–0.015 vs gate 20; orthogonality 0.17–0.32 vs gate 100).eval.py's regex; seeds are unique; new benchmark cases reuse existing shapes so the per-case input size (671 MB at b640/n512, 252 MB at b60/n1024) and_benchmark_batch_count(1 copy) match cases already benchmarked → memory/timeout envelope unchanged.Still needs a maintainer B200 run to confirm timings and that the new benchmark cases stay within
benchmark_timeout/ranked_timeout(the accurate path on the ill-conditioned cases is slower than TF32 by design — that's the cost we now want ranked). Opening as a draft for that reason.Out of scope (noted for a follow-up)
There's an orthogonal hardening opportunity in
eval.py:benchmarkmode times withrecheck=False, and the dominant shapes feed the timed loop a single reused input object, so an output cache keyed on that input would be invisible locally. That's a different attack (output replay, not conditioning routing) and a separate change; happy to follow up if you'd like it in the same PR.🤖 Generated with Claude Code