llama : clear error when MTP draft shares KV cache across backends#25232
Open
liminfei-amd wants to merge 1 commit into
Open
llama : clear error when MTP draft shares KV cache across backends#25232liminfei-amd wants to merge 1 commit into
liminfei-amd wants to merge 1 commit into
Conversation
With MTP speculative decoding, the draft context can share the target model's KV cache (the draft KV layers reuse the target's pre-allocated K/V tensors). Those shared tensors live on the target model's buffer. On a multi-backend build (for example HIP + Vulkan), the draft model may be auto-assigned to a different backend than the target, so the shared tensor cannot be scheduled on the draft's backend. ggml then aborts deep inside the scheduler with an opaque message and no hint about the cause: ggml-backend.cpp:898: pre-allocated tensor (cache_k_lNN) in a buffer (X) that cannot run the operation (NONE) Detect this cross-backend sharing at KV-cache construction and fail with an actionable std::runtime_error that names the mismatched backends and points to the --spec-draft-device workaround, instead of aborting the process. The clean single-backend / pinned-device path is unchanged. Ref: ggml-org#24492 Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Turn the opaque abort reported in #24492 into a clear, actionable error.
When MTP speculative decoding shares the target model's KV cache with the
draft, the draft KV layers reuse the target's pre-allocated K/V tensors,
which live on the target's buffer. On a multi-backend build (e.g. HIP +
Vulkan) the draft model can be auto-assigned to a different backend than
the target, so the shared tensor cannot be scheduled on the draft's backend.
ggml then aborts deep in the scheduler:
There is no hint that speculative decoding or a backend mismatch is the
cause, so the failure is very hard to diagnose (see #24492).
Root cause
llama_kv_cachecan share layers with a source cache (share && other) — thisis how the MTP draft reuses the target's KV. The shared
layer_share.k/varepre-allocated on the source (target) buffer. If the current (draft) layer is
assigned to a different backend (
model.dev_layer(il)), the shared tensor'sbuffer type and the layer's buffer type differ, and the scheduler later fails
to place the tensor and aborts.
Fix
Detect the mismatch at KV-cache construction, at the sharing point, and throw
an actionable
std::runtime_error(caught byllama_init_from_model, whichreturns cleanly) instead of aborting:
The guard only triggers on the broken cross-backend configuration; the clean
single-backend / pinned path is unchanged.
Testing
Dual-backend build (
-DGGML_HIP=ON -DGGML_VULKAN=ON), MTP draft-mtp model(gemma-style, shared KV),
llama-server:--device Vulkan0(no draft pin)GGML_ABORTat ggml-backend.cpp:898,SIGABRT(exit 134)Vulkan0vsROCm1+--spec-draft-devicehint, clean exit 1--device Vulkan0 --spec-draft-device Vulkan0Notes / scope
This does not enable cross-backend KV sharing (still unsupported); it makes
the misconfiguration diagnosable and points at the working
--spec-draft-deviceworkaround, which is what #24492 asked for. The condition is architecture- and
vendor-independent (the same abort has been reported on other backends when the
draft lands on a different device than the target).
Requirements
Ref #24492