Skip to content

I got hold of 2 x Asus Ascent DGX and tested vLLM and Ollama #8691

@CG-8663

Description

@CG-8663

Real-world vLLM benchmarks on ASUS Ascent GX10 — Triton kernels vs GGUF on single node

I ran a head-to-head benchmark of vLLM and Ollama on a single ASUS Ascent GX10 (GB10 superchip — Blackwell GPU + Grace CPU) node — same model (Llama 3.3 70B Instruct), same hardware, isolated runs. The Triton-compiled kernels in vLLM's stack certainly made a measurable difference at every level. I was looking to find a sweet spot from my cluster build, started on Sunday and just got finished 3:23am Wednesday Morning Singapore time

Hardware — Single ASUS Ascent GX10 Node (GB10 Superchip)

  • Blackwell GPU + Grace CPU on a single superchip
  • 128GB DDR5X unified memory (CPU+GPU coherent via NVLink-C2C)
  • 20 ARM Neoverse V2 cores
  • 4TB Gen 5 NVMe

Results

Metric vLLM (AWQ Marlin) Ollama (Q4_K_M GGUF) Delta
Prompt processing 5,847 tok/s 1,268 tok/s 4.6x
Time to first token 243ms 1,412ms 5.8x
Generation throughput 82 tok/s 47 tok/s 1.7x
Concurrent requests (24) All parallel Sequential queue 24x
p99 latency under load 312ms 34,800ms 112x

Why it matters for Triton

The prompt processing gap (4.6x) comes down to how the compute hits the GPU. vLLM's AWQ Marlin kernels are fused INT4 dequant+GEMM operations that keep the tensor cores saturated — they're compiled through Triton's JIT pipeline and benefit directly from:

  • Fused kernel generation — dequantisation and matmul in a single kernel launch, eliminating intermediate memory round-trips
  • PagedAttention — non-contiguous KV cache blocks managed like virtual memory pages, enabling 24 concurrent sequences without OOM
  • CUDA graph capture — Triton-compiled kernels get captured into static graphs, removing per-token launch overhead during generation
  • Continuous batching — new requests slot into running batches mid-generation rather than waiting for the full batch to complete

Ollama's GGUF path uses llama.cpp with Q4_K_M quantisation. It's optimised for single-user interactive use — great for that, but it processes requests sequentially and doesn't exploit tensor core INT4 throughput the way Marlin does.

The concurrent load story

This is where the architecture gap becomes stark. At 24 simultaneous requests:

  • vLLM: 312ms p99 TTFT, all requests processed in parallel via continuous batching
  • Ollama: 34.8s p99 TTFT, requests queued sequentially — request 24 waits for 1-23 to complete

PagedAttention lets vLLM dynamically allocate KV cache blocks per-sequence without pre-reserving contiguous memory. This is what makes 24-way concurrency feasible on a single GPU without fragmentation.

Reproduction

# vLLM
vllm serve meta-llama/Llama-3.3-70B-Instruct-AWQ \
  --quantization awq_marlin \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

# Ollama
ollama run llama3.3:70b-instruct-q4_K_M

Benchmarked with custom async Python harness hitting /v1/chat/completions endpoint at 1, 8, 16, and 24 concurrent connections.

Full writeup with architecture diagrams and Mac cluster (Exo distributed inference) comparison: https://chronara.io/news/vllm-vs-ollama-benchmark

I hope it helps someone

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions