Optimize RWKV7 inference by fusing some graph operators#25206
Optimize RWKV7 inference by fusing some graph operators#25206MollySophia wants to merge 14 commits into
Conversation
|
Hi @MollySophia, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Regard PR flags: 1 & 2: This PR makes changes on the existing RWKV7 path, rather than support a new model or feature. The WKV7 op has some semantic changes so that all the supported backends need to be modified at once. |
Summary
This PR optimizes RWKV7 inference by reducing graph-level operator overhead
across CUDA and Vulkan, with CPU fallback coverage for the new RWKV-specific
operators.
Although RWKV is architecturally more linear/recurrent than Qwen3.5-style
hybrid models, the current ggml graph for RWKV7 expands into many small
elementwise and reduction operators around WKV. In decode, these small ops
create significant launch/dispatch overhead. This PR makes the main RWKV7 hot
paths explicit ggml ops so backends do not need to rediscover RWKV semantics
through fragile graph-pattern fusion.
Changes
Specialize RWKV7 WKV decode paths.
wkv7_t1decode shader.Add explicit RWKV time-mix lerp op:
RWKV_LERP:cur + (x_prev - cur) * weightFuse CUDA RWKV key-update pattern:
k + (a * ka - ka)Add explicit RWKV7
r_kcorrection op:RWKV_RK:cur + reshape(v * sum_rows((k * r) * r_k))rwkv_rk.compshader.Work around a Vulkan CI correctness failure by disabling the specialized
RWKV7 T=1 subgroup shader on Intel proprietary Windows. This is a separate
commit from the RWKV op changes so it can be reverted independently if the
driver-specific issue is resolved.
Performance
Hardware:
0eca4d490RWKV7 1.5B
GPU model:
rwkv7-g1g-1.5b-20260526-ctx8192-FP16.ggufCPU model:
rwkv7-g1g-1.5b-20260526-ctx8192-q4_0.ggufPrefill
pp512pp512pp512Decode
tg128tg128tg128RWKV7 7.2B
Model:
rwkv7-g1g-7.2b-20260523-ctx8192-F16.ggufPrefill
pp512pp512Decode
tg128tg512tg128tg512Validation
test-backend-ops test -o RWKV_WKV7 -b CUDA0 -j 1test-backend-ops test -o RWKV_WKV7 -b Vulkan0 -j 19.5497 -> 9.55036.4011 -> 6.4013Notes
The largest gains come from reducing small graph operators and backend launches
around RWKV decode. On larger models, decode gains are smaller because mat-vec
work dominates more strongly, but both CUDA and Vulkan still show consistent
positive improvements.
TODO