Fix Q4_1 GGUF loading#3664
Open
ricky-chaoju wants to merge 1 commit into
Open
Conversation
eb3ecb2 to
440dc36
Compare
ricky-chaoju
added a commit
to vllm-project/vllm-metal
that referenced
this pull request
Jun 13, 2026
The original roadmap had this PR covering Q8_0, Q4_0, and Q4_1. While implementing it, I found that MLX's GGUF repack path currently mis-decodes Q4_1 weights. Specifically, `mx.load()` appears to silently corrupt Q4_1 tensors because it skips a 2-byte block header, while Q4_1 uses a 4-byte block header. I checked this against the `gguf` package's `dequantize` implementation: Q4_1 shows a max absolute error of about `5.0`, while Q8_0 and Q4_0 are around `0.002`. I filed the upstream fix here: ml-explore/mlx#3664. Because of that, this PR ships **Q8_0 + Q4_0 only**. Q4_1 is rejected at construction time with a clear error pointing to ml-explore/mlx#3664. Re-enabling Q4_1 should be done as a follow-up once a fixed MLX release is within our pinned range. We currently pin `mlx==0.31.2`, so that follow-up will also need the usual MLX bump and JIT-build check. --------- Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>
5 tasks
zcbenz
approved these changes
Jun 13, 2026
zcbenz
left a comment
Collaborator
There was a problem hiding this comment.
Looks good to me, thanks!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
mx.load()on a GGUF file containing Q4_1 tensors returns silently corrupted weights. Scales and biases load correctly and every shape checks out, so nothing fails loudly — the dequantized values are just wrong.unpack_32_4()inmlx/io/gguf_quants.cpphard-codes a 2-byte block-header skip (data[j + 2], "to skip scale bytes"). That matches Q4_0's 18-byte block (|f16 d|16B quants|), but Q4_1 blocks are 20 bytes with a 4-byte header (|f16 d|f16 m|16B quants|).extract_q4_1_data()passes the block start, so the two bias bytes get decoded as the first four nibbles and the last two quant bytes of every block are dropped.Fix
unpack_32_4()now takes a pointer to the first quant byte and each caller skips its own header explicitly (data + 2for Q4_0,data + 4for Q4_1). No behavior change for Q4_0/Q8_0.Validation
Against the
ggufpython package's referencedequantize(N(0,1) weights, shape(8, 64)):Found while building GGUF support for the vLLM Metal plugin (vllm-project/vllm-metal#415).