Fix Q4_1 GGUF loading by ricky-chaoju · Pull Request #3664 · ml-explore/mlx

ricky-chaoju · 2026-06-12T13:43:00Z

Problem

mx.load() on a GGUF file containing Q4_1 tensors returns silently corrupted weights. Scales and biases load correctly and every shape checks out, so nothing fails loudly — the dequantized values are just wrong.

unpack_32_4() in mlx/io/gguf_quants.cpp hard-codes a 2-byte block-header skip (data[j + 2], "to skip scale bytes"). That matches Q4_0's 18-byte block (|f16 d|16B quants|), but Q4_1 blocks are 20 bytes with a 4-byte header (|f16 d|f16 m|16B quants|). extract_q4_1_data() passes the block start, so the two bias bytes get decoded as the first four nibbles and the last two quant bytes of every block are dropped.

Fix

unpack_32_4() now takes a pointer to the first quant byte and each caller skips its own header explicitly (data + 2 for Q4_0, data + 4 for Q4_1). No behavior change for Q4_0/Q8_0.

Validation

Against the gguf python package's reference dequantize (N(0,1) weights, shape (8, 64)):

qtype	max abs err before	after
Q8_0	0.0020	0.0020
Q4_0	0.0020	0.0020
Q4_1	5.08	0.0020

Found while building GGUF support for the vLLM Metal plugin (vllm-project/vllm-metal#415).

The original roadmap had this PR covering Q8_0, Q4_0, and Q4_1. While implementing it, I found that MLX's GGUF repack path currently mis-decodes Q4_1 weights. Specifically, `mx.load()` appears to silently corrupt Q4_1 tensors because it skips a 2-byte block header, while Q4_1 uses a 4-byte block header. I checked this against the `gguf` package's `dequantize` implementation: Q4_1 shows a max absolute error of about `5.0`, while Q8_0 and Q4_0 are around `0.002`. I filed the upstream fix here: ml-explore/mlx#3664. Because of that, this PR ships **Q8_0 + Q4_0 only**. Q4_1 is rejected at construction time with a clear error pointing to ml-explore/mlx#3664. Re-enabling Q4_1 should be done as a follow-up once a fixed MLX release is within our pinned range. We currently pin `mlx==0.31.2`, so that follow-up will also need the usual MLX bump and JIT-build check. --------- Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>

zcbenz

Looks good to me, thanks!

Fix Q4_1 GGUF loading

440dc36

ricky-chaoju force-pushed the fix-gguf-q4-1-load branch from eb3ecb2 to 440dc36 Compare June 12, 2026 13:48

ricky-chaoju mentioned this pull request Jun 13, 2026

[GGUF] Add MLX-native Q8_0/Q4_0 runtime boundary vllm-project/vllm-metal#442

Merged

ricky-chaoju mentioned this pull request Jun 13, 2026

Add native GGUF model loading and quantized Metal execution support on Apple Silicon. vllm-project/vllm-metal#415

Open

5 tasks

zcbenz approved these changes Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Q4_1 GGUF loading#3664

Fix Q4_1 GGUF loading#3664
ricky-chaoju wants to merge 1 commit into
ml-explore:mainfrom
ricky-chaoju:fix-gguf-q4-1-load

ricky-chaoju commented Jun 12, 2026 •

edited

Loading

Uh oh!

zcbenz left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ricky-chaoju commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix

Validation

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ricky-chaoju commented Jun 12, 2026 •

edited

Loading