linalg/qr: benchmark heterogeneous & ill-conditioned batches by robobryce · Pull Request #149 · gpu-mode/reference-kernels

robobryce · 2026-06-13T20:34:35Z

Summary

The linalg/qr_py benchmark ranks submissions only on well-conditioned dense batches, while every ill-conditioned input lives in the correctness tests — and every generated batch is conditioning-homogeneous. That combination rewards a kernel that detects "this batch is well-conditioned" and routes the whole batch to a fast path (TF32 / CholeskyQR) that is only numerically valid for well-conditioned inputs. This PR makes conditioning robustness part of the score, not just a gate.

The problem

Looking at the current task.yml:

benchmarks: (what determines ranking) are all case: dense, cond 1–2 — uniformly well-conditioned.
tests: (correctness only) are where rankdef, clustered, band, rowscale, nearcollinear, nearrank live.

And in reference.py, generate_input applies each case to the entire batch by broadcasting (a * scales, a * _band_mask(...), a[:, :, rank:] = 0, …) — so all batch matrices in a call share one conditioning structure.

A submission can therefore:

Sample a few matrices from the batch,
Observe they're well-conditioned (true for every ranked case, and — because batches are homogeneous — representative of the whole batch),
Route the whole batch to a fast path that doesn't hold the QR tolerance on ill-conditioned inputs.

Such a kernel wins the ranked cases, and the only inputs that could expose it (the stress cases) are never benchmarked — and, being homogeneous, never even mix a hidden ill-conditioned matrix into an otherwise-dense batch. This doesn't match the stated workload either: the description targets optimizer-style statistics (G @ G.T, per-layer/per-block factors), where the matrices batched into one call have widely varying conditioning at arbitrary positions, not one shared structure.

(Observed in practice: an autocuda optimization run on this problem converged on exactly this batch-level conditioning probe — sample the first few matrices, route the batch — which is the rational response to the benchmark as written. The kernel isn't cheating; the benchmark under-specifies the distribution.)

The fix

reference.py — add a mixed case: each matrix in the batch is independently assigned a conditioning profile (a well-conditioned dense majority, ~50%, interleaved with the existing ill-conditioned structures) at a random, seeded position. The per-case logic is factored into a shared _apply_case helper; the mixed path scatters per-profile sub-batches by a torch.multinomial label vector, with a guarantee that tiny batches still contain both a well- and an ill-conditioned matrix. Determinism is preserved (seeded generator), and existing homogeneous cases produce bit-for-bit identical data (the base randn is still drawn first, then the same case extras in the same order — verified on CPU), so prior leaderboard results for the existing cases are unaffected.

task.yml — make conditioning robustness ranked:

Add mixed to both tests: and benchmarks:.
Add fully ill-conditioned homogeneous batches (rankdef, clustered, nearrank) to benchmarks: at the dominant shapes, so the runtime cost of the accurate path on hard inputs is part of the geomean.
Expand the description to document mixed and the intent.

With a heterogeneous batch present, a "sample-a-few-and-route-the-batch" kernel is either wrong (fast-paths a hidden ill-conditioned matrix → fails the checker, which runs unconditionally on the warm-up output) or conservatively slow (routes the whole batch to the exact path). Either way the shortcut no longer yields a free ranking win.

Validation done

I could not run on a B200 (the target GPU was busy with another job), so this was validated on CPU against an unmodified copy of upstream reference.py:

Byte-identical: all 9 existing homogeneous cases + the default dense path produce torch.equal data before/after the refactor.
mixed is correct: deterministic per seed, distinct across seeds, finite, correct shape; ~60–70% well- / ~30–40% ill-conditioned by the same per-column-spread + sparsity metric a routing kernel would use, with ill-conditioned matrices at random interleaved positions (e.g. 2, 4, 5, 7, 9, …) rather than clustered at the front.
Problem stays well-posed: the reference torch.geqrf passes check_implementation on every new case with wide margin (scaled factor residual 0.002–0.015 vs gate 20; orthogonality 0.17–0.32 vs gate 100).
Spec lines parse under eval.py's regex; seeds are unique; new benchmark cases reuse existing shapes so the per-case input size (671 MB at b640/n512, 252 MB at b60/n1024) and _benchmark_batch_count (1 copy) match cases already benchmarked → memory/timeout envelope unchanged.

Still needs a maintainer B200 run to confirm timings and that the new benchmark cases stay within benchmark_timeout/ranked_timeout (the accurate path on the ill-conditioned cases is slower than TF32 by design — that's the cost we now want ranked). Opening as a draft for that reason.

Out of scope (noted for a follow-up)

There's an orthogonal hardening opportunity in eval.py: benchmark mode times with recheck=False, and the dominant shapes feed the timed loop a single reused input object, so an output cache keyed on that input would be invisible locally. That's a different attack (output replay, not conditioning routing) and a separate change; happy to follow up if you'd like it in the same PR.

🤖 Generated with Claude Code

The QR benchmark set ranked only well-conditioned dense batches (cond 1-2), while the ill-conditioned stress structures (rankdef, clustered, band, rowscale, nearcollinear, nearrank) appeared only in the correctness tests. Worse, every batch was conditioning-homogeneous: generate_input applied one structure uniformly to all `batch` matrices. Together these let a submission read a few matrices, conclude "the whole batch is well-conditioned," and route the entire batch to a TF32/Cholesky fast path that is only numerically valid for well-conditioned inputs -- winning the ranked (all-dense) cases while the unranked stress cases were the only thing that could have exposed the shortcut. On a realistic batch (per-layer / per- block optimizer factors with varying conditioning, in random positions) such a kernel is either wrong or silently falls back, but the benchmark never built one. This change makes conditioning robustness part of the score, not just a gate: - reference.py: add a `mixed` case that assigns each matrix in the batch an independent conditioning profile (well-conditioned dense majority interleaved with the ill-conditioned structures) at a random, seeded position. The per-case logic is factored into `_apply_case` and reused; existing homogeneous cases produce bit-for-bit identical data (verified on CPU), so prior leaderboard results are unaffected. - task.yml: add `mixed` cases to the tests AND the benchmarks, plus fully ill-conditioned homogeneous batches (rankdef/clustered/nearrank) at the dominant benchmark shapes, so the runtime cost of the accurate path on hard inputs is ranked too. The reference `torch.geqrf` passes the checker on all new cases with wide margin (scaled factor residual 0.002-0.015 vs gate 20; orthogonality 0.17-0.32 vs gate 100), so the problem stays well-posed. New benchmark cases reuse existing shapes, so the memory/timeout envelope is unchanged. Not yet validated on B200 (authored while the target GPU was busy); needs a benchmark/leaderboard run to confirm timings and timeouts. Co-Authored-By: Claude <noreply@anthropic.com>

brycelelbach · 2026-06-13T20:37:32Z

Yeah, I think you should turn recheck mode on for benchmark mode in a separate PR.

Please measure the wall clock time of running Benchmark and Leaderboard before and after this change.

robobryce · 2026-06-13T20:55:19Z

Addressed both points, and validated on the B200.

1. recheck for benchmark mode → separate PR. Done: #150 turns on recheck=True for benchmark mode on its own branch off main, independent of this PR.

2. Wall-clock before/after. Measured on the B200 with the torch.geqrf baseline submission, each run under the autocuda GPU lock (autocuda run exclusive) so the timing was uncontended with the optimization fleet on the node. "before" = current upstream (7 benchmark cases); "after" = this PR (12 benchmark cases: +2 mixed, +3 fully ill-conditioned).

Mode	before (7 cases)	after (12 cases)	delta
Benchmark	16.9 s	37.1 s	+20.3 s
Leaderboard	36.5 s	70.5 s	+34.1 s

The added cost is the 5 new cases, each of which exercises the accurate (slower-than-TF32) path that the new ranking is meant to reward; per-case timings for the 7 pre-existing cases are unchanged within run-to-run noise (e.g. the dominant n=512/b=640 shape: 1075.2M ns before vs 1076.5M ns after). All runs reported check: pass. Both modes stay within the configured benchmark_timeout: 480 / ranked_timeout: 900.

Validation. test mode passes all 22 cases on the B200, including the 3 new mixed tests, with healthy margins (new mixed/stress cases: scaled factor residual ~0.002–0.011 vs gate 20; orthogonality ~0.18–0.40 vs gate 100) — so the problem stays well-posed under real cuSOLVER geqrf.

Marking ready for review.

robobryce mentioned this pull request Jun 13, 2026

linalg/qr: re-validate timed benchmark outputs (recheck=True) #150

Open

robobryce marked this pull request as ready for review June 13, 2026 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg/qr: benchmark heterogeneous & ill-conditioned batches#149

linalg/qr: benchmark heterogeneous & ill-conditioned batches#149
robobryce wants to merge 1 commit into
gpu-mode:mainfrom
robobryce:linalg-qr-mixed-conditioning

robobryce commented Jun 13, 2026

Uh oh!

brycelelbach commented Jun 13, 2026

Uh oh!

robobryce commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robobryce commented Jun 13, 2026

Summary

The problem

The fix

Validation done

Out of scope (noted for a follow-up)

Uh oh!

brycelelbach commented Jun 13, 2026

Uh oh!

robobryce commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants