Skip to content

fix grid for large uncontiguous input#3666

Open
nastya236 wants to merge 1 commit into
mainfrom
grid-fix-uncontiguous
Open

fix grid for large uncontiguous input#3666
nastya236 wants to merge 1 commit into
mainfrom
grid-fix-uncontiguous

Conversation

@nastya236

Copy link
Copy Markdown
Collaborator

CUDA kernels for non-contiguous tensor operations used a 2D launch grid where y covered all rest dimensions. CUDA limits gridDim.y <= 65,535. This results in errors for large noncontiguous arrays for binary, copy and some unary ops.
For example, for a tensor of shape (16, 8192, 32, 128) -- queries of batch size 16 for qwen 4B:

import mlx.core as mx

shape = (32, 8192, 32, 128)  
x = mx.random.normal(shape=shape)
y = mx.random.normal(shape=shape)

x = x.transpose(0, 2, 1, 3) 
y = y.transpose(0, 2, 1, 3)

z = x + y
mx.eval(z)

-> RuntimeError: cudaGraphAddKernelNode(&node, graph_, NULL, 0, &params) failed: invalid argument. Same will be for mx.exp(x) and #3659.

I split the overflow to gridDim.z.

@zcbenz zcbenz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

I think it is worth adding a new utility function to compute the dims, like the get_launch_args but for noncontiguous ones.

Also the test is allocating a large buffer that crashes CI, and likely would also be too slow to run locally. I remember there is a slow test set that run separately, we should move it there, not sure if it is still a thing though @angeloskath

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants