Benchmark + analysis: rust-gpu vs hand-WGSL - parity on integer kernels, ~1.8x on a path tracer (CFG shape), bounds checks worth 2.5x #614

botBehavior · 2026-06-10T08:48:10Z

botBehavior
Jun 10, 2026

Hi! We built what we believe is the first published benchmark of rust-gpu-emitted SPIR-V against hand-written WGSL twins, plus a root-cause analysis of the gaps. Sharing because two findings seem directly useful to the project, and one is a genuinely good story for rust-gpu.

Repro repo (every number traces to a tagged commit; cold-clone tested): https://github.com/botBehavior/rustgpu-bench
Live demo (same Rust function on WebGPU + WASM): https://botbehavior.github.io/rustgpu-bench/
Possibly related: Use VulkanShaderExamples rust-gpu port as benchmarks #315 (benchmark infrastructure — happy to contribute these workloads/harness if useful) and [Migrated] Inliner->mem2reg can quadratically amplify already-exponential inlining (turning seconds into minutes). #63 (inliner→mem2reg pathology, which is consistent with the CFG-shape finding below).

Setup

rust-gpu/spirv-std 0.10.0-alpha.1 (nightly-2026-04-11), wgpu/naga 29.0.3, RTX 5070 Ti (driver 32.0.15.9649), Windows 11, Vulkan backend.
Three workloads (Collatz 1M, naive matmul 1024³, iterative path tracer 800×450@32spp); same algorithms / workgroup sizes / buffers in both languages; hand-WGSL written fresh, not transpiled. Every arm correctness-gated against a CPU oracle before timing.
GPU timestamp queries around the compute pass only; medians of 30 after warmup; independently cross-checked with amortized wall-clock (all ratios reproduce).

Results (median ms)

workload	rust-gpu (passthrough)	rust-gpu via naga frontend	hand-WGSL
collatz 1M	0.186	0.347	0.197
matmul 1024³	1.798	1.794	1.566
matmul 1024³ `get_unchecked`	0.696	1.668	—
path tracer	1.098	1.360	0.598

Findings

Branchy integer code: parity (rust-gpu slightly ahead on Collatz). The "equivalent SPIR-V ⇒ equivalent speed" expectation holds here.
Slice bounds checks are a 2.5× lever on hot loops: get_unchecked takes the matmul from 1.798 → 0.696 ms — i.e. 2.1× faster than the hand-WGSL twin, which cannot opt out of wgpu's clamp checks. (Through the naga frontend the win disappears — wgpu re-injects its runtime checks — so this is a native-passthrough advantage. It's also a nice rust-gpu pitch: Rust has an audited escape hatch, WGSL doesn't.)
The path-tracer gap (1.84×) is codegen shape, not bloat or math. Evidence:
- Instruction counts near-equal: rust-gpu render_cs 466 ops vs 426 total for the naga-compiled hand twin. Transcendentals are native OpExtInst (9 of them) — the "software libm" theory is wrong.
- Structure differs sharply: rust-gpu emits one flattened function, 74 blocks, 40 OpPhi (the inline-everything consequence of logical addressing), vs naga's 11 small structured functions, ~0 Phi (memory-form locals). The NVIDIA compiler clearly digests the latter better on this workload.
- Threading RNG state by value instead of &mut changed nothing (rust-gpu inlines by policy regardless), and the experimental qptr pipeline (--spirt-passes=qptr) builds + runs all our workloads correctly but is perf-neutral here, as expected.
So the actionable question for SPIR-T: is there room for Phi-reduction / block-layout / selective-outlining work targeting driver-compiler friendliness? The tracer is small and self-contained if you want it as a test case.

Caveats

One GPU / one driver / one OS; alpha-toolchain snapshot; hand-WGSL idiomatic but not heroically tuned (the comparison targets codegen, not effort). Methodology details, raw JSON, and the opcode-stats tool are all in the repo.

Disclosure: this work was produced by an AI assistant (Claude) operating under human direction and review; all numbers come from committed, tagged, reproducible runs.

Happy to run variations (different drivers, workgroup sizes, more workloads) if useful.

Firestar99 · 2026-06-10T09:51:42Z

Firestar99
Jun 10, 2026
Maintainer

First of all, thank you for making this, this is super interesting! I know @nazar-pc asked about some sort of benchmark comparing rust-gpu to other systems in the past.

Here's my results on a Strix Halo 8060S on Linux with (standard) RADV drivers:

adapter: Radeon 8060S Graphics (RADV GFX1151) (Vulkan)
collatz  rustgpu-spv   median    0.460 ms  (p25   0.437, p75   0.465)  module     0.0 ms  xcheck 0.648 ms; verified 1048576 exact
collatz  rustgpu-naga  median    0.580 ms  (p25   0.557, p75   0.598)  module     0.8 ms  xcheck 0.669 ms; verified 1048576 exact
collatz  hand-wgsl     median    0.423 ms  (p25   0.410, p75   0.435)  module     0.1 ms  xcheck 0.515 ms; verified 1048576 exact
matmul   rustgpu-spv   median    3.519 ms  (p25   3.455, p75   3.581)  module     0.0 ms  xcheck 3.803 ms; verified 1000 samples, worst rel 5.1e-6
matmul   rustgpu-naga  median    3.756 ms  (p25   3.659, p75   3.781)  module     0.8 ms  xcheck 4.475 ms; verified 1000 samples, worst rel 5.1e-6
matmul   hand-wgsl     median    2.797 ms  (p25   2.711, p75   2.891)  module     0.2 ms  xcheck 2.956 ms; verified 1000 samples, worst rel 5.1e-6
matmul_unchecked rustgpu-spv   median    2.898 ms  (p25   2.856, p75   2.928)  module     0.0 ms  xcheck 4.239 ms; verified 1000 samples, worst rel 5.1e-6
matmul_unchecked rustgpu-naga  median    3.068 ms  (p25   3.006, p75   3.100)  module     0.8 ms  xcheck 3.599 ms; verified 1000 samples, worst rel 5.1e-6
render   rustgpu-spv   median    1.447 ms  (p25   1.375, p75   1.514)  module     0.0 ms  xcheck 1.552 ms; verified, mean diff 1.1e-4
render   rustgpu-naga  median    4.238 ms  (p25   4.154, p75   4.351)  module     0.8 ms  xcheck 4.311 ms; verified, mean diff 3.4e-5
render   hand-wgsl     median    1.514 ms  (p25   1.462, p75   1.554)  module     0.9 ms  xcheck 1.499 ms; verified, mean diff 9.8e-5
render_v2 rustgpu-spv   median    1.481 ms  (p25   1.462, p75   1.519)  module     0.0 ms  xcheck 1.520 ms; verified, mean diff 1.1e-4
render_v2 rustgpu-naga  median    4.379 ms  (p25   4.312, p75   4.483)  module     0.8 ms  xcheck 4.249 ms; verified, mean diff 3.4e-5

Notice how your render (raymarcher of CSG) is just as fast with rustgpu-spv and hand-wgsl?

I'm also noticing that your runs are so cheap that they barely utilize my GPU, and your GPU is likely similarly just sitting mostly idle. Generally, GPUs are throughput optimized, so benchmarking their latency often doesn't really make much sense. Also note that you're measuring the runtime on the CPU, whereas you could also measure runtime on the GPU with Vulkan extensions (that are'nt exposed in wgpu so actually measuring that is hard). You're submitting work and then the CPU waits for GPU work to complete, measuring that before starting the next run, during which your GPU sits idle waiting for work, which also isn't great for measuring throughput. Yea benchmarking GPUs can be a lot harder than CPUs...

Minor notes regarding rust-gpu setup:

crates.io cargo-gpu is a name-reservation stub — prints "Coming Soon", exits 0.

cargo install uses the latest release, which is still the stub. But you can cargo install cargo-gpu@0.10.0-alpha.1. But thanks for reporting this, I'll may just make the next crates release a full release instead of an alpha to fix this.

glam version trap: spirv-std 0.10.0-alpha.1 allows glam >= 0.30.8...

fixed with #613

rustup self-update race can kill the first backend build on Windows.

It's not just windows, and it's slightly more complicated than that. cargo-gpu currently doesn't have a lockfile and is racy. First it installs the rust toolchain via rustup, which is also racy and may cause your toolchain to be installed but be broken, requiring you to remove and reinstall the toolchain. I've seen it mess up installs when you terminate rustup during installation. But they also race within cargo-gpu building the spirv codegen backend, you often see that in it "failing to remove Cargo.lock". Usually, just waiting for the build to finish and retrying works, but I do agree a proper lockfile would be nicer.

cargo gpu build --shader-crate shaders --output-dir shaders/spv --auto-install-rust-toolchain

You can also build the shaders within your build.rs build scripts, so you don't have to manually build them. Have a look at rust-gpu-template, or this example build script specifcally. (Our main readme doesn't mention this repo yet, want to update docs this week)

1 reply

Firestar99 Jun 10, 2026
Maintainer

For fun, I've reran the render benches on 2048x2048 32spp:

render   rustgpu-spv   median   25.157 ms  (p25  23.121, p75  28.916)  module     0.0 ms  xcheck 24.196 ms; verified, mean diff 1.0e-4
render   rustgpu-naga  median   51.336 ms  (p25  50.582, p75  54.795)  module     0.8 ms  xcheck 50.661 ms; verified, mean diff 3.3e-5
render   hand-wgsl     median   18.379 ms  (p25  17.870, p75  19.158)  module     1.5 ms  xcheck 18.447 ms; verified, mean diff 8.4e-5
render_v2 rustgpu-spv   median   18.220 ms  (p25  17.858, p75  19.045)  module     0.0 ms  xcheck 17.966 ms; verified, mean diff 1.0e-4
render_v2 rustgpu-naga  median   50.385 ms  (p25  49.713, p75  52.242)  module     0.8 ms  xcheck 50.349 ms; verified, mean diff 3.3e-5

Interesting to see naga transpilation crashing performance this much, and your v1 performing a decent bit worse than v2.

nazar-pc · 2026-06-10T10:07:31Z

nazar-pc
Jun 10, 2026

wgpu re-injects its runtime checks

Wait, WAT?

How do I get rid of it? I use get_unchecked() and such in well-designed and thoroughly audited code for a reason! But I can't use pass-through (at least unconditionally) since I want to target Metal backend too.

6 replies

nazar-pc Jun 10, 2026

Hm... is there a way to tell wgpu "trust be bro"? I know there is no oob in my code to begin with, so there is no need to worry about whether it is UB or not.

Firestar99 Jun 10, 2026
Maintainer

That's what pass-through is for :D

Now I'd be curious how much performance you'd gain by having this as an opt-in.

nazar-pc Jun 10, 2026

Man, that is really annoying. I don't want to have even more platform-specific code in my project. And I don't want to have bounds checks on Metal either, where I can't feed SPIR-V using pass-through.

Argghhh.......

Firestar99 Jun 10, 2026
Maintainer

What about using create_shader_module_unchecked() instead of create_shader_module()? I'd assume this also requires DeviceDescriptor.experimental_features = ExperimentalFeatures::enabled().

Internally, naga has a struct called BoundsCheckPolicies, which defines how to handle oob behaviour, including options for BoundsCheckPolicy::Unchecked. And pretty much every naga backend has the option to adjust bounds check insertion on emission, like the options for msl (metal). So you also could do the spv -> msl transpilation yourself with bounds checks disabled and pass it as a passthrough shader.

nazar-pc Jun 10, 2026

Nice, I'll give it a try then, thanks!

LegNeato · 2026-06-10T17:03:40Z

LegNeato
Jun 10, 2026
Maintainer

@botBehavior based on what you are doing you might be interested in this (older) example and blog post: https://rust-gpu.github.io/blog/2025/07/25/rust-on-every-gpu/.

0 replies

Benchmark + analysis: rust-gpu vs hand-WGSL - parity on integer kernels, ~1.8x on a path tracer (CFG shape), bounds checks worth 2.5x #614

Uh oh!

botBehavior Jun 10, 2026

Setup

Results (median ms)

Findings

Caveats

Replies: 3 comments · 7 replies

Uh oh!

Firestar99 Jun 10, 2026 Maintainer

Minor notes regarding rust-gpu setup:

Uh oh!

Firestar99 Jun 10, 2026 Maintainer

Uh oh!

nazar-pc Jun 10, 2026

Uh oh!

nazar-pc Jun 10, 2026

Uh oh!

Firestar99 Jun 10, 2026 Maintainer

Uh oh!

nazar-pc Jun 10, 2026

Uh oh!

Uh oh!

Firestar99 Jun 10, 2026 Maintainer

Uh oh!

nazar-pc Jun 10, 2026

Uh oh!

LegNeato Jun 10, 2026 Maintainer

botBehavior
Jun 10, 2026

Replies: 3 comments 7 replies

Firestar99
Jun 10, 2026
Maintainer

Firestar99 Jun 10, 2026
Maintainer

nazar-pc
Jun 10, 2026

Firestar99 Jun 10, 2026
Maintainer

Firestar99 Jun 10, 2026
Maintainer

LegNeato
Jun 10, 2026
Maintainer