linalg/qr: re-validate timed benchmark outputs (recheck=True) by robobryce · Pull Request #150 · gpu-mode/reference-kernels

robobryce · 2026-06-13T20:54:48Z

Summary

run_benchmarking times each shape with recheck=False, so only the pre-timing warmup output is validated — the timed iterations themselves are never re-checked. Combined with the fact that the low-count benchmark shapes feed the timed loop a single reused input object across all repeats, a kernel that diverges only inside the timed region (e.g. one that caches and replays an output keyed on that stable input) is scored as fast and never caught locally. Only the remote leaderboard mode — which already passes recheck=True — would have any chance of noticing.

This sets recheck=True on the per-shape benchmark call so benchmark mode matches leaderboard mode: any timed-loop divergence fails locally too. The warmup correctness check and the timing methodology are otherwise unchanged.

-        result = run_single_benchmark(pool, test, False, 200, 10e9)
+        result = run_single_benchmark(pool, test, True, 200, 10e9)

Cost (measured on B200)

Wall-clock for benchmark mode with the torch.geqrf baseline over the 7 current benchmark shapes, run under the autocuda GPU lock so the measurement was uncontended:

	benchmark wall-clock
`recheck=False` (before)	16.9 s
`recheck=True` (after)	22.7 s
delta	+5.8 s (~34%)

The extra cost is one FP64 check_implementation pass (materialize Q = householder_product(H, tau), two batched matmuls) per timed iteration. Correctness gate unchanged (check: pass). This stays well within benchmark_timeout: 480.

Relationship to #149

Split out from #149 (heterogeneous/ill-conditioned benchmark inputs) at the maintainer's request — that PR closes the conditioning-routing loophole; this one closes the output-replay loophole. They are independent and can merge in either order.

Notes / open question

This does not, by itself, stop a cache that replays a previously-correct (H, tau) for the same unchanged input: the checker validates input-derived QR invariants against that same A, so a same-input replay still passes even under recheck. Fully closing that would additionally require re-cloning/regenerating the input on each timed repeat for the low-count shapes (a more invasive change to the timing methodology). Happy to follow up with that if you want it; I kept this PR to the minimal, clearly-correct one-line change.

🤖 Generated with Claude Code

`run_benchmarking` timed each shape with recheck=False, so only the pre-timing warmup output was validated -- the timed iterations were never re-checked. For the low-`count` benchmark shapes the timed loop reuses a single input object across all repeats, so a kernel that diverges only inside the timed region (e.g. one that caches and replays an output keyed on the reused input) is scored as fast and never caught locally; only the remote `leaderboard` mode, which already passes recheck=True, would have a chance to notice. Set recheck=True on the per-shape benchmark call so `benchmark` mode matches `leaderboard` mode and any timed-loop divergence fails locally. The warmup correctness check and the timing methodology are otherwise unchanged. Cost (B200, torch.geqrf baseline, the 7 current benchmark shapes): benchmark wall-clock 16.9s -> 22.7s (+5.8s, ~34%), from the extra FP64 checker pass per timed iteration. Correctness gate unchanged (check: pass). Co-Authored-By: Claude <noreply@anthropic.com>

robobryce mentioned this pull request Jun 13, 2026

linalg/qr: benchmark heterogeneous & ill-conditioned batches #149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg/qr: re-validate timed benchmark outputs (recheck=True)#150

linalg/qr: re-validate timed benchmark outputs (recheck=True)#150
robobryce wants to merge 1 commit into
gpu-mode:mainfrom
robobryce:linalg-qr-benchmark-recheck

robobryce commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robobryce commented Jun 13, 2026

Summary

Cost (measured on B200)

Relationship to #149

Notes / open question

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants