fix grid for large uncontiguous input by nastya236 · Pull Request #3666 · ml-explore/mlx

nastya236 · 2026-06-12T22:05:28Z

CUDA kernels for non-contiguous tensor operations used a 2D launch grid where y covered all rest dimensions. CUDA limits gridDim.y <= 65,535. This results in errors for large noncontiguous arrays for binary, copy and some unary ops.
For example, for a tensor of shape (16, 8192, 32, 128) -- queries of batch size 16 for qwen 4B:

import mlx.core as mx

shape = (32, 8192, 32, 128)  
x = mx.random.normal(shape=shape)
y = mx.random.normal(shape=shape)

x = x.transpose(0, 2, 1, 3) 
y = y.transpose(0, 2, 1, 3)

z = x + y
mx.eval(z)

-> RuntimeError: cudaGraphAddKernelNode(&node, graph_, NULL, 0, &params) failed: invalid argument. Same will be for mx.exp(x) and #3659.

I split the overflow to gridDim.z.

zcbenz

Nice fix!

I think it is worth adding a new utility function to compute the dims, like the get_launch_args but for noncontiguous ones.

Also the test is allocating a large buffer that crashes CI, and likely would also be too slow to run locally. I remember there is a slow test set that run separately, we should move it there, not sure if it is still a thing though @angeloskath

fix grid for large uncontiguous input

4bb64bf

zcbenz approved these changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix grid for large uncontiguous input#3666

fix grid for large uncontiguous input#3666
nastya236 wants to merge 1 commit into
mainfrom
grid-fix-uncontiguous

nastya236 commented Jun 12, 2026

Uh oh!

zcbenz left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nastya236 commented Jun 12, 2026

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants