Optimize RWKV7 inference by fusing some graph operators by MollySophia · Pull Request #25206 · ggml-org/llama.cpp

MollySophia · 2026-07-01T16:41:15Z

Summary

This PR optimizes RWKV7 inference by reducing graph-level operator overhead
across CUDA and Vulkan, with CPU fallback coverage for the new RWKV-specific
operators.

Although RWKV is architecturally more linear/recurrent than Qwen3.5-style
hybrid models, the current ggml graph for RWKV7 expands into many small
elementwise and reduction operators around WKV. In decode, these small ops
create significant launch/dispatch overhead. This PR makes the main RWKV7 hot
paths explicit ggml ops so backends do not need to rediscover RWKV semantics
through fragile graph-pattern fusion.

Changes

Specialize RWKV7 WKV decode paths.
- CUDA: optimized WKV7 decode kernel.
- Vulkan: added specialized wkv7_t1 decode shader.
Add explicit RWKV time-mix lerp op:
- RWKV_LERP: cur + (x_prev - cur) * weight
- CUDA and Vulkan call the existing specialized kernels directly.
- CPU has a F32 fallback with a contiguous RWKV fast path.
Fuse CUDA RWKV key-update pattern:
- k + (a * ka - ka)
Add explicit RWKV7 r_k correction op:
- RWKV_RK: cur + reshape(v * sum_rows((k * r) * r_k))
- CUDA: added standalone fused kernel.
- Vulkan: added standalone rwkv_rk.comp shader.
Work around a Vulkan CI correctness failure by disabling the specialized
RWKV7 T=1 subgroup shader on Intel proprietary Windows. This is a separate
commit from the RWKV op changes so it can be reverted independently if the
driver-specific issue is resolved.

Performance

Hardware:

GPU: NVIDIA GeForce RTX 5090
CPU: AMD Ryzen 9 9950X
CPU thread count: 16
Baseline commit: 0eca4d490

RWKV7 1.5B

GPU model: rwkv7-g1g-1.5b-20260526-ctx8192-FP16.gguf

CPU model: rwkv7-g1g-1.5b-20260526-ctx8192-q4_0.gguf

Prefill

Backend	Model	Test	Baseline	Current	Delta
CUDA	RWKV7 1.5B F16	`pp512`	23092.43 +/- 1706.81	25148.17 +/- 2650.54	+8.9%
Vulkan	RWKV7 1.5B F16	`pp512`	16885.02 +/- 37.00	20720.94 +/- 47.62	+22.7%
CPU 16t	RWKV7 1.5B Q4_0	`pp512`	1112.15 +/- 2.48	1192.53 +/- 7.59	+7.2%

Decode

Backend	Model	Test	Baseline	Current	Delta
CUDA	RWKV7 1.5B F16	`tg128`	282.29 +/- 1.52	322.79 +/- 1.29	+14.3%
Vulkan	RWKV7 1.5B F16	`tg128`	244.56 +/- 0.60	287.48 +/- 0.97	+17.5%
CPU 16t	RWKV7 1.5B Q4_0	`tg128`	64.47 +/- 0.51	65.54 +/- 0.51	+1.7%

RWKV7 7.2B

Model: rwkv7-g1g-7.2b-20260523-ctx8192-F16.gguf

Prefill

Backend	Model	Test	Baseline	Current	Delta
CUDA	RWKV7 7.2B F16	`pp512`	8190.77 +/- 562.32	9336.82 +/- 648.37	+14.0%
Vulkan	RWKV7 7.2B F16	`pp512`	7289.55 +/- 11.08	8464.32 +/- 5.15	+16.1%

Decode

Backend	Model	Test	Baseline	Current	Delta
CUDA	RWKV7 7.2B F16	`tg128`	90.32 +/- 0.19	95.76 +/- 0.13	+6.0%
CUDA	RWKV7 7.2B F16	`tg512`	90.51 +/- 0.03	95.91 +/- 0.03	+6.0%
Vulkan	RWKV7 7.2B F16	`tg128`	83.84 +/- 0.01	90.10 +/- 0.02	+7.5%
Vulkan	RWKV7 7.2B F16	`tg512`	83.95 +/- 0.05	90.26 +/- 0.04	+7.5%

Validation

Backend op smoke tests:
- test-backend-ops test -o RWKV_WKV7 -b CUDA0 -j 1
- test-backend-ops test -o RWKV_WKV7 -b Vulkan0 -j 1
WikiText-2 perplexity is unchanged within noise:
- 1.5B F16: 9.5497 -> 9.5503
- 7.2B F16: 6.4011 -> 6.4013

Notes

The largest gains come from reducing small graph operators and backend launches
around RWKV decode. On larger models, decode gains are smaller because mat-vec
work dominates more strongly, but both CUDA and Vulkan still show consistent
positive improvements.

TODO

[] SYCL/Metal fused kernels

ggml-gh-bot · 2026-07-01T16:45:44Z

Hi @MollySophia, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.
Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

MollySophia · 2026-07-02T14:15:15Z

Hi @MollySophia, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.

Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Regard PR flags:

1 & 2: This PR makes changes on the existing RWKV7 path, rather than support a new model or feature. The WKV7 op has some semantic changes so that all the supported backends need to be modified at once.

MollySophia added 8 commits July 1, 2026 14:06

rwkv: fuse wkv7 graph operations

e533e30

ggml: fuse norm affine operations

bf4320f

vulkan: fuse norm affine operations

a47b818

ggml: fuse add-mul elementwise operations

c88673f

cuda: specialize rwkv lerp fusion

6bf3c2b

cuda: specialize rwkv7 wkv decode

f5d7d00

vulkan: specialize rwkv7 decode and lerp

6a470d5

rwkv: fuse rk correction separately

402d7a7

MollySophia requested review from a team, CISC and ggerganov as code owners July 1, 2026 16:41

MollySophia marked this pull request as draft July 1, 2026 16:41

MollySophia removed request for CISC and ggerganov July 1, 2026 16:52

MollySophia added 2 commits July 2, 2026 00:59

some style changes

37ed79b

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

Reduce RWKV fusion scheduler intrusiveness

ca6d972

MollySophia commented Jul 2, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/fused-ops.cuh

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

MollySophia added 2 commits July 2, 2026 21:20

rwkv: use explicit ops for lerp and rk correction

7029af7

vulkan: disable rwkv7 t1 shader on Intel Windows

bdbd216

MollySophia force-pushed the opt-rwkv branch from 9769a90 to bdbd216 Compare July 2, 2026 13:21

MollySophia added 2 commits July 2, 2026 21:52

tests: add rwkv lerp and rk backend ops

03d17d5

cuda: clarify elementwise fusion aliasing check

7839033

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize RWKV7 inference by fusing some graph operators#25206

Optimize RWKV7 inference by fusing some graph operators#25206
MollySophia wants to merge 14 commits into
ggml-org:masterfrom
MollySophia:opt-rwkv

MollySophia commented Jul 1, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented Jul 1, 2026

Uh oh!

Uh oh!

Uh oh!

MollySophia commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MollySophia commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Performance

RWKV7 1.5B

Prefill

Decode

RWKV7 7.2B

Prefill

Decode

Validation

Notes

TODO

Uh oh!

ggml-gh-bot Bot commented Jul 1, 2026

Uh oh!

Uh oh!

Uh oh!

MollySophia commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MollySophia commented Jul 1, 2026 •

edited

Loading