Skip to content

linalg/qr: re-validate timed benchmark outputs (recheck=True)#150

Open
robobryce wants to merge 1 commit into
gpu-mode:mainfrom
robobryce:linalg-qr-benchmark-recheck
Open

linalg/qr: re-validate timed benchmark outputs (recheck=True)#150
robobryce wants to merge 1 commit into
gpu-mode:mainfrom
robobryce:linalg-qr-benchmark-recheck

Conversation

@robobryce

Copy link
Copy Markdown

Summary

run_benchmarking times each shape with recheck=False, so only the pre-timing warmup output is validated — the timed iterations themselves are never re-checked. Combined with the fact that the low-count benchmark shapes feed the timed loop a single reused input object across all repeats, a kernel that diverges only inside the timed region (e.g. one that caches and replays an output keyed on that stable input) is scored as fast and never caught locally. Only the remote leaderboard mode — which already passes recheck=True — would have any chance of noticing.

This sets recheck=True on the per-shape benchmark call so benchmark mode matches leaderboard mode: any timed-loop divergence fails locally too. The warmup correctness check and the timing methodology are otherwise unchanged.

-        result = run_single_benchmark(pool, test, False, 200, 10e9)
+        result = run_single_benchmark(pool, test, True, 200, 10e9)

Cost (measured on B200)

Wall-clock for benchmark mode with the torch.geqrf baseline over the 7 current benchmark shapes, run under the autocuda GPU lock so the measurement was uncontended:

benchmark wall-clock
recheck=False (before) 16.9 s
recheck=True (after) 22.7 s
delta +5.8 s (~34%)

The extra cost is one FP64 check_implementation pass (materialize Q = householder_product(H, tau), two batched matmuls) per timed iteration. Correctness gate unchanged (check: pass). This stays well within benchmark_timeout: 480.

Relationship to #149

Split out from #149 (heterogeneous/ill-conditioned benchmark inputs) at the maintainer's request — that PR closes the conditioning-routing loophole; this one closes the output-replay loophole. They are independent and can merge in either order.

Notes / open question

This does not, by itself, stop a cache that replays a previously-correct (H, tau) for the same unchanged input: the checker validates input-derived QR invariants against that same A, so a same-input replay still passes even under recheck. Fully closing that would additionally require re-cloning/regenerating the input on each timed repeat for the low-count shapes (a more invasive change to the timing methodology). Happy to follow up with that if you want it; I kept this PR to the minimal, clearly-correct one-line change.

🤖 Generated with Claude Code

`run_benchmarking` timed each shape with recheck=False, so only the
pre-timing warmup output was validated -- the timed iterations were never
re-checked. For the low-`count` benchmark shapes the timed loop reuses a
single input object across all repeats, so a kernel that diverges only inside
the timed region (e.g. one that caches and replays an output keyed on the
reused input) is scored as fast and never caught locally; only the remote
`leaderboard` mode, which already passes recheck=True, would have a chance to
notice.

Set recheck=True on the per-shape benchmark call so `benchmark` mode matches
`leaderboard` mode and any timed-loop divergence fails locally. The warmup
correctness check and the timing methodology are otherwise unchanged.

Cost (B200, torch.geqrf baseline, the 7 current benchmark shapes): benchmark
wall-clock 16.9s -> 22.7s (+5.8s, ~34%), from the extra FP64 checker pass per
timed iteration. Correctness gate unchanged (check: pass).

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants