Skip to content

Optimize RWKV7 inference by fusing some graph operators#25206

Draft
MollySophia wants to merge 14 commits into
ggml-org:masterfrom
MollySophia:opt-rwkv
Draft

Optimize RWKV7 inference by fusing some graph operators#25206
MollySophia wants to merge 14 commits into
ggml-org:masterfrom
MollySophia:opt-rwkv

Conversation

@MollySophia

@MollySophia MollySophia commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR optimizes RWKV7 inference by reducing graph-level operator overhead
across CUDA and Vulkan, with CPU fallback coverage for the new RWKV-specific
operators.

Although RWKV is architecturally more linear/recurrent than Qwen3.5-style
hybrid models, the current ggml graph for RWKV7 expands into many small
elementwise and reduction operators around WKV. In decode, these small ops
create significant launch/dispatch overhead. This PR makes the main RWKV7 hot
paths explicit ggml ops so backends do not need to rediscover RWKV semantics
through fragile graph-pattern fusion.

Changes

  • Specialize RWKV7 WKV decode paths.

    • CUDA: optimized WKV7 decode kernel.
    • Vulkan: added specialized wkv7_t1 decode shader.
  • Add explicit RWKV time-mix lerp op:

    • RWKV_LERP: cur + (x_prev - cur) * weight
    • CUDA and Vulkan call the existing specialized kernels directly.
    • CPU has a F32 fallback with a contiguous RWKV fast path.
  • Fuse CUDA RWKV key-update pattern:

    • k + (a * ka - ka)
  • Add explicit RWKV7 r_k correction op:

    • RWKV_RK: cur + reshape(v * sum_rows((k * r) * r_k))
    • CUDA: added standalone fused kernel.
    • Vulkan: added standalone rwkv_rk.comp shader.
  • Work around a Vulkan CI correctness failure by disabling the specialized
    RWKV7 T=1 subgroup shader on Intel proprietary Windows. This is a separate
    commit from the RWKV op changes so it can be reverted independently if the
    driver-specific issue is resolved.

Performance

Hardware:

  • GPU: NVIDIA GeForce RTX 5090
  • CPU: AMD Ryzen 9 9950X
  • CPU thread count: 16
  • Baseline commit: 0eca4d490

RWKV7 1.5B

GPU model: rwkv7-g1g-1.5b-20260526-ctx8192-FP16.gguf

CPU model: rwkv7-g1g-1.5b-20260526-ctx8192-q4_0.gguf

Prefill

Backend Model Test Baseline Current Delta
CUDA RWKV7 1.5B F16 pp512 23092.43 +/- 1706.81 25148.17 +/- 2650.54 +8.9%
Vulkan RWKV7 1.5B F16 pp512 16885.02 +/- 37.00 20720.94 +/- 47.62 +22.7%
CPU 16t RWKV7 1.5B Q4_0 pp512 1112.15 +/- 2.48 1192.53 +/- 7.59 +7.2%

Decode

Backend Model Test Baseline Current Delta
CUDA RWKV7 1.5B F16 tg128 282.29 +/- 1.52 322.79 +/- 1.29 +14.3%
Vulkan RWKV7 1.5B F16 tg128 244.56 +/- 0.60 287.48 +/- 0.97 +17.5%
CPU 16t RWKV7 1.5B Q4_0 tg128 64.47 +/- 0.51 65.54 +/- 0.51 +1.7%

RWKV7 7.2B

Model: rwkv7-g1g-7.2b-20260523-ctx8192-F16.gguf

Prefill

Backend Model Test Baseline Current Delta
CUDA RWKV7 7.2B F16 pp512 8190.77 +/- 562.32 9336.82 +/- 648.37 +14.0%
Vulkan RWKV7 7.2B F16 pp512 7289.55 +/- 11.08 8464.32 +/- 5.15 +16.1%

Decode

Backend Model Test Baseline Current Delta
CUDA RWKV7 7.2B F16 tg128 90.32 +/- 0.19 95.76 +/- 0.13 +6.0%
CUDA RWKV7 7.2B F16 tg512 90.51 +/- 0.03 95.91 +/- 0.03 +6.0%
Vulkan RWKV7 7.2B F16 tg128 83.84 +/- 0.01 90.10 +/- 0.02 +7.5%
Vulkan RWKV7 7.2B F16 tg512 83.95 +/- 0.05 90.26 +/- 0.04 +7.5%

Validation

  • Backend op smoke tests:
    • test-backend-ops test -o RWKV_WKV7 -b CUDA0 -j 1
    • test-backend-ops test -o RWKV_WKV7 -b Vulkan0 -j 1
  • WikiText-2 perplexity is unchanged within noise:
    • 1.5B F16: 9.5497 -> 9.5503
    • 7.2B F16: 6.4011 -> 6.4013

Notes

The largest gains come from reducing small graph operators and backend launches
around RWKV decode. On larger models, decode gains are smaller because mat-vec
work dominates more strongly, but both CUDA and Vulkan still show consistent
positive improvements.

TODO

  • [] SYCL/Metal fused kernels

@MollySophia MollySophia requested review from a team, CISC and ggerganov as code owners July 1, 2026 16:41
@MollySophia MollySophia marked this pull request as draft July 1, 2026 16:41
@ggml-gh-bot

ggml-gh-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

Hi @MollySophia, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@github-actions github-actions Bot added model Model specific testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) CUDA Related to the CUDA backend labels Jul 1, 2026
@MollySophia MollySophia removed request for CISC and ggerganov July 1, 2026 16:52
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Comment thread ggml/src/ggml-cuda/fused-ops.cuh
Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated
@MollySophia

Copy link
Copy Markdown
Collaborator Author

Hi @MollySophia, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.
  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Regard PR flags:

1 & 2: This PR makes changes on the existing RWKV7 path, rather than support a new model or feature. The WKV7 op has some semantic changes so that all the supported backends need to be modified at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) CUDA Related to the CUDA backend ggml changes relating to the ggml tensor library for machine learning model Model specific SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant