Real-world vLLM benchmarks on ASUS Ascent GX10 — Triton kernels vs GGUF on single node
I ran a head-to-head benchmark of vLLM and Ollama on a single ASUS Ascent GX10 (GB10 superchip — Blackwell GPU + Grace CPU) node — same model (Llama 3.3 70B Instruct), same hardware, isolated runs. The Triton-compiled kernels in vLLM's stack certainly made a measurable difference at every level. I was looking to find a sweet spot from my cluster build, started on Sunday and just got finished 3:23am Wednesday Morning Singapore time
Hardware — Single ASUS Ascent GX10 Node (GB10 Superchip)
- Blackwell GPU + Grace CPU on a single superchip
- 128GB DDR5X unified memory (CPU+GPU coherent via NVLink-C2C)
- 20 ARM Neoverse V2 cores
- 4TB Gen 5 NVMe
Results
| Metric |
vLLM (AWQ Marlin) |
Ollama (Q4_K_M GGUF) |
Delta |
| Prompt processing |
5,847 tok/s |
1,268 tok/s |
4.6x |
| Time to first token |
243ms |
1,412ms |
5.8x |
| Generation throughput |
82 tok/s |
47 tok/s |
1.7x |
| Concurrent requests (24) |
All parallel |
Sequential queue |
24x |
| p99 latency under load |
312ms |
34,800ms |
112x |
Why it matters for Triton
The prompt processing gap (4.6x) comes down to how the compute hits the GPU. vLLM's AWQ Marlin kernels are fused INT4 dequant+GEMM operations that keep the tensor cores saturated — they're compiled through Triton's JIT pipeline and benefit directly from:
- Fused kernel generation — dequantisation and matmul in a single kernel launch, eliminating intermediate memory round-trips
- PagedAttention — non-contiguous KV cache blocks managed like virtual memory pages, enabling 24 concurrent sequences without OOM
- CUDA graph capture — Triton-compiled kernels get captured into static graphs, removing per-token launch overhead during generation
- Continuous batching — new requests slot into running batches mid-generation rather than waiting for the full batch to complete
Ollama's GGUF path uses llama.cpp with Q4_K_M quantisation. It's optimised for single-user interactive use — great for that, but it processes requests sequentially and doesn't exploit tensor core INT4 throughput the way Marlin does.
The concurrent load story
This is where the architecture gap becomes stark. At 24 simultaneous requests:
- vLLM: 312ms p99 TTFT, all requests processed in parallel via continuous batching
- Ollama: 34.8s p99 TTFT, requests queued sequentially — request 24 waits for 1-23 to complete
PagedAttention lets vLLM dynamically allocate KV cache blocks per-sequence without pre-reserving contiguous memory. This is what makes 24-way concurrency feasible on a single GPU without fragmentation.
Reproduction
# vLLM
vllm serve meta-llama/Llama-3.3-70B-Instruct-AWQ \
--quantization awq_marlin \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92
# Ollama
ollama run llama3.3:70b-instruct-q4_K_M
Benchmarked with custom async Python harness hitting /v1/chat/completions endpoint at 1, 8, 16, and 24 concurrent connections.
Full writeup with architecture diagrams and Mac cluster (Exo distributed inference) comparison: https://chronara.io/news/vllm-vs-ollama-benchmark
I hope it helps someone
Real-world vLLM benchmarks on ASUS Ascent GX10 — Triton kernels vs GGUF on single node
I ran a head-to-head benchmark of vLLM and Ollama on a single ASUS Ascent GX10 (GB10 superchip — Blackwell GPU + Grace CPU) node — same model (Llama 3.3 70B Instruct), same hardware, isolated runs. The Triton-compiled kernels in vLLM's stack certainly made a measurable difference at every level. I was looking to find a sweet spot from my cluster build, started on Sunday and just got finished 3:23am Wednesday Morning Singapore time
Hardware — Single ASUS Ascent GX10 Node (GB10 Superchip)
Results
Why it matters for Triton
The prompt processing gap (4.6x) comes down to how the compute hits the GPU. vLLM's AWQ Marlin kernels are fused INT4 dequant+GEMM operations that keep the tensor cores saturated — they're compiled through Triton's JIT pipeline and benefit directly from:
Ollama's GGUF path uses llama.cpp with Q4_K_M quantisation. It's optimised for single-user interactive use — great for that, but it processes requests sequentially and doesn't exploit tensor core INT4 throughput the way Marlin does.
The concurrent load story
This is where the architecture gap becomes stark. At 24 simultaneous requests:
PagedAttention lets vLLM dynamically allocate KV cache blocks per-sequence without pre-reserving contiguous memory. This is what makes 24-way concurrency feasible on a single GPU without fragmentation.
Reproduction
Benchmarked with custom async Python harness hitting
/v1/chat/completionsendpoint at 1, 8, 16, and 24 concurrent connections.Full writeup with architecture diagrams and Mac cluster (Exo distributed inference) comparison: https://chronara.io/news/vllm-vs-ollama-benchmark
I hope it helps someone