llama : clear error when MTP draft shares KV cache across backends by liminfei-amd · Pull Request #25232 · ggml-org/llama.cpp

liminfei-amd · 2026-07-02T09:51:37Z

Summary

Turn the opaque abort reported in #24492 into a clear, actionable error.

When MTP speculative decoding shares the target model's KV cache with the
draft, the draft KV layers reuse the target's pre-allocated K/V tensors,
which live on the target's buffer. On a multi-backend build (e.g. HIP +
Vulkan) the draft model can be auto-assigned to a different backend than
the target, so the shared tensor cannot be scheduled on the draft's backend.
ggml then aborts deep in the scheduler:

ggml-backend.cpp:898: pre-allocated tensor (cache_k_lNN) in a buffer (Vulkan0) that cannot run the operation (NONE)

There is no hint that speculative decoding or a backend mismatch is the
cause, so the failure is very hard to diagnose (see #24492).

Root cause

llama_kv_cache can share layers with a source cache (share && other) — this
is how the MTP draft reuses the target's KV. The shared layer_share.k/v are
pre-allocated on the source (target) buffer. If the current (draft) layer is
assigned to a different backend (model.dev_layer(il)), the shared tensor's
buffer type and the layer's buffer type differ, and the scheduler later fails
to place the tensor and aborts.

Fix

Detect the mismatch at KV-cache construction, at the sharing point, and throw
an actionable std::runtime_error (caught by llama_init_from_model, which
returns cleanly) instead of aborting:

llama_kv_cache: layer 3 shares a KV cache tensor allocated on Vulkan0 but is assigned to ROCm1. MTP speculative decoding cannot share the target KV cache across different backends. Pin the draft model to the target device with --spec-draft-device (matching --device), or use a single-backend build. See https://github.com/ggml-org/llama.cpp/issues/24492

The guard only triggers on the broken cross-backend configuration; the clean
single-backend / pinned path is unchanged.

Testing

Dual-backend build (-DGGML_HIP=ON -DGGML_VULKAN=ON), MTP draft-mtp model
(gemma-style, shared KV), llama-server:

Scenario	Before	After
`--device Vulkan0` (no draft pin)	`GGML_ABORT` at ggml-backend.cpp:898, `SIGABRT` (exit 134)	Clear error naming `Vulkan0` vs `ROCm1` + `--spec-draft-device` hint, clean exit 1
`--device Vulkan0 --spec-draft-device Vulkan0`	works	works (no false positive)

Notes / scope

This does not enable cross-backend KV sharing (still unsupported); it makes
the misconfiguration diagnosable and points at the working --spec-draft-device
workaround, which is what #24492 asked for. The condition is architecture- and
vendor-independent (the same abort has been reported on other backends when the
draft lands on a different device than the target).

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: an AI coding assistant helped investigate and reproduce this on real hardware; I authored and reviewed the change and this description, and I can explain and take responsibility for every line.

Ref #24492

With MTP speculative decoding, the draft context can share the target model's KV cache (the draft KV layers reuse the target's pre-allocated K/V tensors). Those shared tensors live on the target model's buffer. On a multi-backend build (for example HIP + Vulkan), the draft model may be auto-assigned to a different backend than the target, so the shared tensor cannot be scheduled on the draft's backend. ggml then aborts deep inside the scheduler with an opaque message and no hint about the cause: ggml-backend.cpp:898: pre-allocated tensor (cache_k_lNN) in a buffer (X) that cannot run the operation (NONE) Detect this cross-backend sharing at KV-cache construction and fail with an actionable std::runtime_error that names the mismatched backends and points to the --spec-draft-device workaround, instead of aborting the process. The clean single-backend / pinned-device path is unchanged. Ref: ggml-org#24492 Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>

liminfei-amd requested a review from ggerganov as a code owner July 2, 2026 09:51

liminfei-amd mentioned this pull request Jul 2, 2026

Eval bug: Gemma 4 31B MTP (draft-mtp) crashes on Vulkan backend, pre-allocated tensor cannot run operation NONE #24492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : clear error when MTP draft shares KV cache across backends#25232

llama : clear error when MTP draft shares KV cache across backends#25232
liminfei-amd wants to merge 1 commit into
ggml-org:masterfrom
liminfei-amd:amd-rocm/24492-mtp-crossbackend-kv-clear-error

liminfei-amd commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

liminfei-amd commented Jul 2, 2026

Summary

Root cause

Fix

Testing

Notes / scope

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant