Skip to content

Fix Q4_1 GGUF loading#3664

Open
ricky-chaoju wants to merge 1 commit into
ml-explore:mainfrom
ricky-chaoju:fix-gguf-q4-1-load
Open

Fix Q4_1 GGUF loading#3664
ricky-chaoju wants to merge 1 commit into
ml-explore:mainfrom
ricky-chaoju:fix-gguf-q4-1-load

Conversation

@ricky-chaoju

@ricky-chaoju ricky-chaoju commented Jun 12, 2026

Copy link
Copy Markdown

Problem

mx.load() on a GGUF file containing Q4_1 tensors returns silently corrupted weights. Scales and biases load correctly and every shape checks out, so nothing fails loudly — the dequantized values are just wrong.

unpack_32_4() in mlx/io/gguf_quants.cpp hard-codes a 2-byte block-header skip (data[j + 2], "to skip scale bytes"). That matches Q4_0's 18-byte block (|f16 d|16B quants|), but Q4_1 blocks are 20 bytes with a 4-byte header (|f16 d|f16 m|16B quants|). extract_q4_1_data() passes the block start, so the two bias bytes get decoded as the first four nibbles and the last two quant bytes of every block are dropped.

Fix

unpack_32_4() now takes a pointer to the first quant byte and each caller skips its own header explicitly (data + 2 for Q4_0, data + 4 for Q4_1). No behavior change for Q4_0/Q8_0.

Validation

Against the gguf python package's reference dequantize (N(0,1) weights, shape (8, 64)):

qtype max abs err before after
Q8_0 0.0020 0.0020
Q4_0 0.0020 0.0020
Q4_1 5.08 0.0020

Found while building GGUF support for the vLLM Metal plugin (vllm-project/vllm-metal#415).

ricky-chaoju added a commit to vllm-project/vllm-metal that referenced this pull request Jun 13, 2026
The original roadmap had this PR covering Q8_0, Q4_0, and Q4_1. While
implementing it, I found that MLX's GGUF repack path currently
mis-decodes Q4_1 weights.

Specifically, `mx.load()` appears to silently corrupt Q4_1 tensors
because it skips a 2-byte block header, while Q4_1 uses a 4-byte block
header. I checked this against the `gguf` package's `dequantize`
implementation: Q4_1 shows a max absolute error of about `5.0`, while
Q8_0 and Q4_0 are around `0.002`.

I filed the upstream fix here: ml-explore/mlx#3664.

Because of that, this PR ships **Q8_0 + Q4_0 only**. Q4_1 is rejected at
construction time with a clear error pointing to ml-explore/mlx#3664.
Re-enabling Q4_1 should be done as a follow-up once a fixed MLX release
is within our pinned range. We currently pin `mlx==0.31.2`, so that
follow-up will also need the usual MLX bump and JIT-build check.

---------

Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>

@zcbenz zcbenz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants