I got hold of 2 x Asus Ascent DGX and tested vLLM and Ollama

## Real-world vLLM benchmarks on ASUS Ascent GX10 — Triton kernels vs GGUF on single node

I ran a head-to-head benchmark of vLLM and Ollama on a single ASUS Ascent GX10 (GB10 superchip — Blackwell GPU + Grace CPU) node — same model (Llama 3.3 70B Instruct), same hardware, isolated runs. The Triton-compiled kernels in vLLM's stack certainly made a measurable difference at every level.  I was looking to find a sweet spot from my cluster build, started on Sunday and just got finished 3:23am Wednesday Morning Singapore time

**Hardware — Single ASUS Ascent GX10 Node (GB10 Superchip)**
- Blackwell GPU + Grace CPU on a single superchip
- 128GB DDR5X unified memory (CPU+GPU coherent via NVLink-C2C)
- 20 ARM Neoverse V2 cores
- 4TB Gen 5 NVMe

**Results**

| Metric | vLLM (AWQ Marlin) | Ollama (Q4_K_M GGUF) | Delta |
| --- | --- | --- | --- |
| Prompt processing | 5,847 tok/s | 1,268 tok/s | **4.6x** |
| Time to first token | 243ms | 1,412ms | **5.8x** |
| Generation throughput | 82 tok/s | 47 tok/s | **1.7x** |
| Concurrent requests (24) | All parallel | Sequential queue | **24x** |
| p99 latency under load | 312ms | 34,800ms | **112x** |

**Why it matters for Triton**

The prompt processing gap (4.6x) comes down to how the compute hits the GPU. vLLM's AWQ Marlin kernels are fused INT4 dequant+GEMM operations that keep the tensor cores saturated — they're compiled through Triton's JIT pipeline and benefit directly from:

- **Fused kernel generation** — dequantisation and matmul in a single kernel launch, eliminating intermediate memory round-trips
- **PagedAttention** — non-contiguous KV cache blocks managed like virtual memory pages, enabling 24 concurrent sequences without OOM
- **CUDA graph capture** — Triton-compiled kernels get captured into static graphs, removing per-token launch overhead during generation
- **Continuous batching** — new requests slot into running batches mid-generation rather than waiting for the full batch to complete

Ollama's GGUF path uses llama.cpp with Q4_K_M quantisation. It's optimised for single-user interactive use — great for that, but it processes requests sequentially and doesn't exploit tensor core INT4 throughput the way Marlin does.

**The concurrent load story**

This is where the architecture gap becomes stark. At 24 simultaneous requests:

- vLLM: 312ms p99 TTFT, all requests processed in parallel via continuous batching
- Ollama: 34.8s p99 TTFT, requests queued sequentially — request 24 waits for 1-23 to complete

PagedAttention lets vLLM dynamically allocate KV cache blocks per-sequence without pre-reserving contiguous memory. This is what makes 24-way concurrency feasible on a single GPU without fragmentation.

**Reproduction**

```bash
# vLLM
vllm serve meta-llama/Llama-3.3-70B-Instruct-AWQ \
  --quantization awq_marlin \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

# Ollama
ollama run llama3.3:70b-instruct-q4_K_M
```

Benchmarked with custom async Python harness hitting `/v1/chat/completions` endpoint at 1, 8, 16, and 24 concurrent connections.

Full writeup with architecture diagrams and Mac cluster (Exo distributed inference) comparison: **https://chronara.io/news/vllm-vs-ollama-benchmark**

I hope it helps someone


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I got hold of 2 x Asus Ascent DGX and tested vLLM and Ollama #8691

Real-world vLLM benchmarks on ASUS Ascent GX10 — Triton kernels vs GGUF on single node

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	vLLM (AWQ Marlin)	Ollama (Q4_K_M GGUF)	Delta
Prompt processing	5,847 tok/s	1,268 tok/s	4.6x
Time to first token	243ms	1,412ms	5.8x
Generation throughput	82 tok/s	47 tok/s	1.7x
Concurrent requests (24)	All parallel	Sequential queue	24x
p99 latency under load	312ms	34,800ms	112x

I got hold of 2 x Asus Ascent DGX and tested vLLM and Ollama #8691

Description

Real-world vLLM benchmarks on ASUS Ascent GX10 — Triton kernels vs GGUF on single node

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions