Skip to content

llama : clear error when MTP draft shares KV cache across backends#25232

Open
liminfei-amd wants to merge 1 commit into
ggml-org:masterfrom
liminfei-amd:amd-rocm/24492-mtp-crossbackend-kv-clear-error
Open

llama : clear error when MTP draft shares KV cache across backends#25232
liminfei-amd wants to merge 1 commit into
ggml-org:masterfrom
liminfei-amd:amd-rocm/24492-mtp-crossbackend-kv-clear-error

Conversation

@liminfei-amd

Copy link
Copy Markdown
Contributor

Summary

Turn the opaque abort reported in #24492 into a clear, actionable error.

When MTP speculative decoding shares the target model's KV cache with the
draft, the draft KV layers reuse the target's pre-allocated K/V tensors,
which live on the target's buffer. On a multi-backend build (e.g. HIP +
Vulkan) the draft model can be auto-assigned to a different backend than
the target, so the shared tensor cannot be scheduled on the draft's backend.
ggml then aborts deep in the scheduler:

ggml-backend.cpp:898: pre-allocated tensor (cache_k_lNN) in a buffer (Vulkan0) that cannot run the operation (NONE)

There is no hint that speculative decoding or a backend mismatch is the
cause, so the failure is very hard to diagnose (see #24492).

Root cause

llama_kv_cache can share layers with a source cache (share && other) — this
is how the MTP draft reuses the target's KV. The shared layer_share.k/v are
pre-allocated on the source (target) buffer. If the current (draft) layer is
assigned to a different backend (model.dev_layer(il)), the shared tensor's
buffer type and the layer's buffer type differ, and the scheduler later fails
to place the tensor and aborts.

Fix

Detect the mismatch at KV-cache construction, at the sharing point, and throw
an actionable std::runtime_error (caught by llama_init_from_model, which
returns cleanly) instead of aborting:

llama_kv_cache: layer 3 shares a KV cache tensor allocated on Vulkan0 but is assigned to ROCm1. MTP speculative decoding cannot share the target KV cache across different backends. Pin the draft model to the target device with --spec-draft-device (matching --device), or use a single-backend build. See https://github.com/ggml-org/llama.cpp/issues/24492

The guard only triggers on the broken cross-backend configuration; the clean
single-backend / pinned path is unchanged.

Testing

Dual-backend build (-DGGML_HIP=ON -DGGML_VULKAN=ON), MTP draft-mtp model
(gemma-style, shared KV), llama-server:

Scenario Before After
--device Vulkan0 (no draft pin) GGML_ABORT at ggml-backend.cpp:898, SIGABRT (exit 134) Clear error naming Vulkan0 vs ROCm1 + --spec-draft-device hint, clean exit 1
--device Vulkan0 --spec-draft-device Vulkan0 works works (no false positive)

Notes / scope

This does not enable cross-backend KV sharing (still unsupported); it makes
the misconfiguration diagnosable and points at the working --spec-draft-device
workaround, which is what #24492 asked for. The condition is architecture- and
vendor-independent (the same abort has been reported on other backends when the
draft lands on a different device than the target).

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: an AI coding assistant helped investigate and reproduce this on real hardware; I authored and reviewed the change and this description, and I can explain and take responsibility for every line.

Ref #24492

With MTP speculative decoding, the draft context can share the target
model's KV cache (the draft KV layers reuse the target's pre-allocated K/V
tensors). Those shared tensors live on the target model's buffer. On a
multi-backend build (for example HIP + Vulkan), the draft model may be
auto-assigned to a different backend than the target, so the shared tensor
cannot be scheduled on the draft's backend. ggml then aborts deep inside the
scheduler with an opaque message and no hint about the cause:

  ggml-backend.cpp:898: pre-allocated tensor (cache_k_lNN) in a buffer (X)
  that cannot run the operation (NONE)

Detect this cross-backend sharing at KV-cache construction and fail with an
actionable std::runtime_error that names the mismatched backends and points
to the --spec-draft-device workaround, instead of aborting the process.

The clean single-backend / pinned-device path is unchanged.

Ref: ggml-org#24492

Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant