Implement batched matmul for large 1D dot products by Ved235 · Pull Request #3580 · ml-explore/mlx

Ved235 · 2026-05-22T09:55:50Z

Proposed changes

Addresses issue #3533. Adds routing logic in mlx/ops.cpp so that it divides the large 1D dot product into chunks so gemv parallelizes.

Benchmark

import mlx.core as mx
import numpy as np
import time

def bench(fn, rounds=100, label=""):
    for _ in range(3):
        r = fn()
        mx.eval(r)

    times = []
    for _ in range(rounds):
        mx.eval()  
        t0 = time.perf_counter()
        r = fn()
        mx.eval(r) 
        times.append(time.perf_counter() - t0)

    times.sort()
    median = times[len(times) // 2]
    best = times[0]
    worst = times[-1]
    print(f"{label}")
    print(f"median={median*1000:.3f}ms | min={best*1000:.3f}ms | max={worst*1000:.3f}ms")
    return r

a = mx.random.normal(shape=(50_000_000,), dtype=mx.float32)
b = mx.random.normal(shape=(50_000_000,), dtype=mx.float32)

a_np = np.array(a, copy=False)
b_np = np.array(b, copy=False)

ccc = bench(lambda: mx.inner(a, b), label="MLX native")

print(f"mx.inner : {float(ccc)}")

Using this benchmarking script the performance changes are:

median=15.393ms | min=15.323ms | max=15.769ms

to

median=1.741ms | min=1.682ms | max=1.835ms

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

Ved235 · 2026-05-23T20:38:16Z

@zcbenz would it possible for you to review this?

zcbenz

This basically looks good to me, thanks!

Ved235 · 2026-05-26T18:40:38Z

@zcbenz I have made the changes could you review and merge.

zcbenz

I would like another review from maintainers before merging.

Ved235 · 2026-05-28T19:31:24Z

I would like another review from maintainers before merging.

Ok sure, is there someone specific I should tag to request a review?

Ved235 · 2026-06-06T18:21:15Z

@angeloskath @jagrit06 it would be great if you could review these changes

Ved235 · 2026-06-12T07:36:27Z

@zcbenz Its been several weeks since the changes and I believe that the changes are not very large in terms of number of lines, so would it be possible for this to be merged?

zcbenz · 2026-06-12T07:40:01Z

Sorry WWDC had largely disrupted our schedule, there is no wrong with this PR it is just I need another view on this since I lack the background knowledge. WWDC is over now and there is a large backlog so please give us more time.

Ved235 · 2026-06-12T07:42:41Z

No worries and thanks for the update.

angeloskath

@Ved235 sorry for the super late reply, especially since it will be a negative response.

This is something we need to fix indeed but unfortunately this is not the way to fix it. Basically the ops should not really change based on the shape but the implementation should. The same way that we route to split-k kernel when the matrix K dimension is large.

I am gonna mark this as requested changes so we don't merge it by accident and make a new PR with a specialized kernel for this particular case. I am not sure how important it is but I think the kernel will end up being simple enough.

Ved235 · 2026-06-13T04:05:11Z

Should I make a PR for this specialised kernel?

Ved235 added 2 commits May 22, 2026 14:49

Route large 1D dot products through batched matmul

fb488bc

Change chunk size

24d639d

zcbenz approved these changes May 24, 2026

View reviewed changes

Comment thread mlx/ops.cpp Outdated

Ved235 and others added 2 commits May 24, 2026 10:44

Remove operand level checking

0d2e620

Merge branch 'ml-explore:main' into main

3fa3979

zcbenz approved these changes May 27, 2026

View reviewed changes

zcbenz requested review from angeloskath and jagrit06 May 28, 2026 22:49

angeloskath requested changes Jun 12, 2026

View reviewed changes

Merge branch 'ml-explore:main' into main

d70a89c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement batched matmul for large 1D dot products#3580

Implement batched matmul for large 1D dot products#3580
Ved235 wants to merge 5 commits into
ml-explore:mainfrom
Ved235:main

Ved235 commented May 22, 2026

Uh oh!

Ved235 commented May 23, 2026

Uh oh!

zcbenz left a comment

Uh oh!

Uh oh!

Ved235 commented May 26, 2026

Uh oh!

zcbenz left a comment

Uh oh!

Ved235 commented May 28, 2026

Uh oh!

Ved235 commented Jun 6, 2026

Uh oh!

Ved235 commented Jun 12, 2026

Uh oh!

zcbenz commented Jun 12, 2026

Uh oh!

Ved235 commented Jun 12, 2026

Uh oh!

angeloskath left a comment

Uh oh!

Ved235 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Ved235 commented May 22, 2026

Proposed changes

Benchmark

Checklist

Uh oh!

Ved235 commented May 23, 2026

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Ved235 commented May 26, 2026

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Ved235 commented May 28, 2026

Uh oh!

Ved235 commented Jun 6, 2026

Uh oh!

Ved235 commented Jun 12, 2026

Uh oh!

zcbenz commented Jun 12, 2026

Uh oh!

Ved235 commented Jun 12, 2026

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Ved235 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants