Skip to content

linalg/qr: benchmark heterogeneous & ill-conditioned batches#149

Open
robobryce wants to merge 1 commit into
gpu-mode:mainfrom
robobryce:linalg-qr-mixed-conditioning
Open

linalg/qr: benchmark heterogeneous & ill-conditioned batches#149
robobryce wants to merge 1 commit into
gpu-mode:mainfrom
robobryce:linalg-qr-mixed-conditioning

Conversation

@robobryce

Copy link
Copy Markdown

Summary

The linalg/qr_py benchmark ranks submissions only on well-conditioned dense batches, while every ill-conditioned input lives in the correctness tests — and every generated batch is conditioning-homogeneous. That combination rewards a kernel that detects "this batch is well-conditioned" and routes the whole batch to a fast path (TF32 / CholeskyQR) that is only numerically valid for well-conditioned inputs. This PR makes conditioning robustness part of the score, not just a gate.

The problem

Looking at the current task.yml:

  • benchmarks: (what determines ranking) are all case: dense, cond 1–2 — uniformly well-conditioned.
  • tests: (correctness only) are where rankdef, clustered, band, rowscale, nearcollinear, nearrank live.

And in reference.py, generate_input applies each case to the entire batch by broadcasting (a * scales, a * _band_mask(...), a[:, :, rank:] = 0, …) — so all batch matrices in a call share one conditioning structure.

A submission can therefore:

  1. Sample a few matrices from the batch,
  2. Observe they're well-conditioned (true for every ranked case, and — because batches are homogeneous — representative of the whole batch),
  3. Route the whole batch to a fast path that doesn't hold the QR tolerance on ill-conditioned inputs.

Such a kernel wins the ranked cases, and the only inputs that could expose it (the stress cases) are never benchmarked — and, being homogeneous, never even mix a hidden ill-conditioned matrix into an otherwise-dense batch. This doesn't match the stated workload either: the description targets optimizer-style statistics (G @ G.T, per-layer/per-block factors), where the matrices batched into one call have widely varying conditioning at arbitrary positions, not one shared structure.

(Observed in practice: an autocuda optimization run on this problem converged on exactly this batch-level conditioning probe — sample the first few matrices, route the batch — which is the rational response to the benchmark as written. The kernel isn't cheating; the benchmark under-specifies the distribution.)

The fix

reference.py — add a mixed case: each matrix in the batch is independently assigned a conditioning profile (a well-conditioned dense majority, ~50%, interleaved with the existing ill-conditioned structures) at a random, seeded position. The per-case logic is factored into a shared _apply_case helper; the mixed path scatters per-profile sub-batches by a torch.multinomial label vector, with a guarantee that tiny batches still contain both a well- and an ill-conditioned matrix. Determinism is preserved (seeded generator), and existing homogeneous cases produce bit-for-bit identical data (the base randn is still drawn first, then the same case extras in the same order — verified on CPU), so prior leaderboard results for the existing cases are unaffected.

task.yml — make conditioning robustness ranked:

  • Add mixed to both tests: and benchmarks:.
  • Add fully ill-conditioned homogeneous batches (rankdef, clustered, nearrank) to benchmarks: at the dominant shapes, so the runtime cost of the accurate path on hard inputs is part of the geomean.
  • Expand the description to document mixed and the intent.

With a heterogeneous batch present, a "sample-a-few-and-route-the-batch" kernel is either wrong (fast-paths a hidden ill-conditioned matrix → fails the checker, which runs unconditionally on the warm-up output) or conservatively slow (routes the whole batch to the exact path). Either way the shortcut no longer yields a free ranking win.

Validation done

I could not run on a B200 (the target GPU was busy with another job), so this was validated on CPU against an unmodified copy of upstream reference.py:

  • Byte-identical: all 9 existing homogeneous cases + the default dense path produce torch.equal data before/after the refactor.
  • mixed is correct: deterministic per seed, distinct across seeds, finite, correct shape; ~60–70% well- / ~30–40% ill-conditioned by the same per-column-spread + sparsity metric a routing kernel would use, with ill-conditioned matrices at random interleaved positions (e.g. 2, 4, 5, 7, 9, …) rather than clustered at the front.
  • Problem stays well-posed: the reference torch.geqrf passes check_implementation on every new case with wide margin (scaled factor residual 0.002–0.015 vs gate 20; orthogonality 0.17–0.32 vs gate 100).
  • Spec lines parse under eval.py's regex; seeds are unique; new benchmark cases reuse existing shapes so the per-case input size (671 MB at b640/n512, 252 MB at b60/n1024) and _benchmark_batch_count (1 copy) match cases already benchmarked → memory/timeout envelope unchanged.

Still needs a maintainer B200 run to confirm timings and that the new benchmark cases stay within benchmark_timeout/ranked_timeout (the accurate path on the ill-conditioned cases is slower than TF32 by design — that's the cost we now want ranked). Opening as a draft for that reason.

Out of scope (noted for a follow-up)

There's an orthogonal hardening opportunity in eval.py: benchmark mode times with recheck=False, and the dominant shapes feed the timed loop a single reused input object, so an output cache keyed on that input would be invisible locally. That's a different attack (output replay, not conditioning routing) and a separate change; happy to follow up if you'd like it in the same PR.

🤖 Generated with Claude Code

The QR benchmark set ranked only well-conditioned dense batches (cond 1-2),
while the ill-conditioned stress structures (rankdef, clustered, band,
rowscale, nearcollinear, nearrank) appeared only in the correctness tests.
Worse, every batch was conditioning-homogeneous: generate_input applied one
structure uniformly to all `batch` matrices.

Together these let a submission read a few matrices, conclude "the whole batch
is well-conditioned," and route the entire batch to a TF32/Cholesky fast path
that is only numerically valid for well-conditioned inputs -- winning the
ranked (all-dense) cases while the unranked stress cases were the only thing
that could have exposed the shortcut. On a realistic batch (per-layer / per-
block optimizer factors with varying conditioning, in random positions) such a
kernel is either wrong or silently falls back, but the benchmark never built
one.

This change makes conditioning robustness part of the score, not just a gate:

- reference.py: add a `mixed` case that assigns each matrix in the batch an
  independent conditioning profile (well-conditioned dense majority interleaved
  with the ill-conditioned structures) at a random, seeded position. The
  per-case logic is factored into `_apply_case` and reused; existing
  homogeneous cases produce bit-for-bit identical data (verified on CPU), so
  prior leaderboard results are unaffected.
- task.yml: add `mixed` cases to the tests AND the benchmarks, plus fully
  ill-conditioned homogeneous batches (rankdef/clustered/nearrank) at the
  dominant benchmark shapes, so the runtime cost of the accurate path on hard
  inputs is ranked too.

The reference `torch.geqrf` passes the checker on all new cases with wide
margin (scaled factor residual 0.002-0.015 vs gate 20; orthogonality
0.17-0.32 vs gate 100), so the problem stays well-posed. New benchmark cases
reuse existing shapes, so the memory/timeout envelope is unchanged.

Not yet validated on B200 (authored while the target GPU was busy); needs a
benchmark/leaderboard run to confirm timings and timeouts.

Co-Authored-By: Claude <noreply@anthropic.com>
@brycelelbach

Copy link
Copy Markdown

Yeah, I think you should turn recheck mode on for benchmark mode in a separate PR.

Please measure the wall clock time of running Benchmark and Leaderboard before and after this change.

@robobryce robobryce marked this pull request as ready for review June 13, 2026 20:55
@robobryce

Copy link
Copy Markdown
Author

Addressed both points, and validated on the B200.

1. recheck for benchmark mode → separate PR. Done: #150 turns on recheck=True for benchmark mode on its own branch off main, independent of this PR.

2. Wall-clock before/after. Measured on the B200 with the torch.geqrf baseline submission, each run under the autocuda GPU lock (autocuda run exclusive) so the timing was uncontended with the optimization fleet on the node. "before" = current upstream (7 benchmark cases); "after" = this PR (12 benchmark cases: +2 mixed, +3 fully ill-conditioned).

Mode before (7 cases) after (12 cases) delta
Benchmark 16.9 s 37.1 s +20.3 s
Leaderboard 36.5 s 70.5 s +34.1 s

The added cost is the 5 new cases, each of which exercises the accurate (slower-than-TF32) path that the new ranking is meant to reward; per-case timings for the 7 pre-existing cases are unchanged within run-to-run noise (e.g. the dominant n=512/b=640 shape: 1075.2M ns before vs 1076.5M ns after). All runs reported check: pass. Both modes stay within the configured benchmark_timeout: 480 / ranked_timeout: 900.

Validation. test mode passes all 22 cases on the B200, including the 3 new mixed tests, with healthy margins (new mixed/stress cases: scaled factor residual ~0.002–0.011 vs gate 20; orthogonality ~0.18–0.40 vs gate 100) — so the problem stays well-posed under real cuSOLVER geqrf.

Marking ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants