linalg/qr: re-validate timed benchmark outputs (recheck=True)#150
Open
robobryce wants to merge 1 commit into
Open
linalg/qr: re-validate timed benchmark outputs (recheck=True)#150robobryce wants to merge 1 commit into
robobryce wants to merge 1 commit into
Conversation
`run_benchmarking` timed each shape with recheck=False, so only the pre-timing warmup output was validated -- the timed iterations were never re-checked. For the low-`count` benchmark shapes the timed loop reuses a single input object across all repeats, so a kernel that diverges only inside the timed region (e.g. one that caches and replays an output keyed on the reused input) is scored as fast and never caught locally; only the remote `leaderboard` mode, which already passes recheck=True, would have a chance to notice. Set recheck=True on the per-shape benchmark call so `benchmark` mode matches `leaderboard` mode and any timed-loop divergence fails locally. The warmup correctness check and the timing methodology are otherwise unchanged. Cost (B200, torch.geqrf baseline, the 7 current benchmark shapes): benchmark wall-clock 16.9s -> 22.7s (+5.8s, ~34%), from the extra FP64 checker pass per timed iteration. Correctness gate unchanged (check: pass). Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
run_benchmarkingtimes each shape withrecheck=False, so only the pre-timing warmup output is validated — the timed iterations themselves are never re-checked. Combined with the fact that the low-countbenchmark shapes feed the timed loop a single reused input object across all repeats, a kernel that diverges only inside the timed region (e.g. one that caches and replays an output keyed on that stable input) is scored as fast and never caught locally. Only the remoteleaderboardmode — which already passesrecheck=True— would have any chance of noticing.This sets
recheck=Trueon the per-shapebenchmarkcall sobenchmarkmode matchesleaderboardmode: any timed-loop divergence fails locally too. The warmup correctness check and the timing methodology are otherwise unchanged.Cost (measured on B200)
Wall-clock for
benchmarkmode with thetorch.geqrfbaseline over the 7 current benchmark shapes, run under the autocuda GPU lock so the measurement was uncontended:recheck=False(before)recheck=True(after)The extra cost is one FP64
check_implementationpass (materializeQ = householder_product(H, tau), two batched matmuls) per timed iteration. Correctness gate unchanged (check: pass). This stays well withinbenchmark_timeout: 480.Relationship to #149
Split out from #149 (heterogeneous/ill-conditioned benchmark inputs) at the maintainer's request — that PR closes the conditioning-routing loophole; this one closes the output-replay loophole. They are independent and can merge in either order.
Notes / open question
This does not, by itself, stop a cache that replays a previously-correct
(H, tau)for the same unchanged input: the checker validates input-derived QR invariants against that sameA, so a same-input replay still passes even under recheck. Fully closing that would additionally require re-cloning/regenerating the input on each timed repeat for the low-countshapes (a more invasive change to the timing methodology). Happy to follow up with that if you want it; I kept this PR to the minimal, clearly-correct one-line change.🤖 Generated with Claude Code