Benchmark + analysis: rust-gpu vs hand-WGSL - parity on integer kernels, ~1.8x on a path tracer (CFG shape), bounds checks worth 2.5x #614
Replies: 3 comments 7 replies
-
|
First of all, thank you for making this, this is super interesting! I know @nazar-pc asked about some sort of benchmark comparing rust-gpu to other systems in the past. Here's my results on a Strix Halo 8060S on Linux with (standard) RADV drivers: Notice how your render (raymarcher of CSG) is just as fast with rustgpu-spv and hand-wgsl? I'm also noticing that your runs are so cheap that they barely utilize my GPU, and your GPU is likely similarly just sitting mostly idle. Generally, GPUs are throughput optimized, so benchmarking their latency often doesn't really make much sense. Also note that you're measuring the runtime on the CPU, whereas you could also measure runtime on the GPU with Vulkan extensions (that are'nt exposed in wgpu so actually measuring that is hard). You're submitting work and then the CPU waits for GPU work to complete, measuring that before starting the next run, during which your GPU sits idle waiting for work, which also isn't great for measuring throughput. Yea benchmarking GPUs can be a lot harder than CPUs... Minor notes regarding rust-gpu setup:
cargo install uses the latest release, which is still the stub. But you can
fixed with #613
It's not just windows, and it's slightly more complicated than that. cargo-gpu currently doesn't have a lockfile and is racy. First it installs the rust toolchain via rustup, which is also racy and may cause your toolchain to be installed but be broken, requiring you to remove and reinstall the toolchain. I've seen it mess up installs when you terminate rustup during installation. But they also race within cargo-gpu building the spirv codegen backend, you often see that in it "failing to remove
You can also build the shaders within your |
Beta Was this translation helpful? Give feedback.
-
How do I get rid of it? I use |
Beta Was this translation helpful? Give feedback.
-
|
@botBehavior based on what you are doing you might be interested in this (older) example and blog post: https://rust-gpu.github.io/blog/2025/07/25/rust-on-every-gpu/. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
Hi! We built what we believe is the first published benchmark of rust-gpu-emitted SPIR-V against hand-written WGSL twins, plus a root-cause analysis of the gaps. Sharing because two findings seem directly useful to the project, and one is a genuinely good story for rust-gpu.
mem2regcan quadratically amplify already-exponential inlining (turning seconds into minutes). #63 (inliner→mem2reg pathology, which is consistent with the CFG-shape finding below).Setup
0.10.0-alpha.1(nightly-2026-04-11), wgpu/naga29.0.3, RTX 5070 Ti (driver 32.0.15.9649), Windows 11, Vulkan backend.Results (median ms)
get_uncheckedFindings
Branchy integer code: parity (rust-gpu slightly ahead on Collatz). The "equivalent SPIR-V ⇒ equivalent speed" expectation holds here.
Slice bounds checks are a 2.5× lever on hot loops:
get_uncheckedtakes the matmul from 1.798 → 0.696 ms — i.e. 2.1× faster than the hand-WGSL twin, which cannot opt out of wgpu's clamp checks. (Through the naga frontend the win disappears — wgpu re-injects its runtime checks — so this is a native-passthrough advantage. It's also a nice rust-gpu pitch: Rust has an audited escape hatch, WGSL doesn't.)The path-tracer gap (1.84×) is codegen shape, not bloat or math. Evidence:
render_cs466 ops vs 426 total for the naga-compiled hand twin. Transcendentals are nativeOpExtInst(9 of them) — the "software libm" theory is wrong.&mutchanged nothing (rust-gpu inlines by policy regardless), and the experimental qptr pipeline (--spirt-passes=qptr) builds + runs all our workloads correctly but is perf-neutral here, as expected.So the actionable question for SPIR-T: is there room for Phi-reduction / block-layout / selective-outlining work targeting driver-compiler friendliness? The tracer is small and self-contained if you want it as a test case.
Caveats
One GPU / one driver / one OS; alpha-toolchain snapshot; hand-WGSL idiomatic but not heroically tuned (the comparison targets codegen, not effort). Methodology details, raw JSON, and the opcode-stats tool are all in the repo.
Disclosure: this work was produced by an AI assistant (Claude) operating under human direction and review; all numbers come from committed, tagged, reproducible runs.
Happy to run variations (different drivers, workgroup sizes, more workloads) if useful.
Beta Was this translation helpful? Give feedback.
All reactions