vLLM dominates mixed-GPU long-context prefill; SGLang crashes on Ada, llama.cpp 4–6× slower

A seven-GPU benchmark shows vLLM handles heterogeneous Blackwell-Ada clusters and 4-bit pipeline parallelism far better than SGLang or llama.cpp, with SGLang crashing on Ada cards and llama.cpp suffering severe pipeline bubbles.

May 16, 2026

vLLM dominates mixed-GPU long-context prefill; SGLang crashes on Ada, llama.cpp 4–6× slower

"vLLM significantly outperforms competing inference engines on heterogeneous GPU clusters," according to benchmarks posted this week on a seven-GPU test rig mixing Blackwell and Ada cards running long-context prefill with pipeline parallelism.

The test setup combined one RTX PRO 6000 96GB, one PRO 5000 48GB, two 5090 32GB, and three modded 4090 48GB cards, all running 4-bit weights—NVFP4 for vLLM and SGLang, MXFP4 for llama.cpp. On a mixed two-GPU run of Qwen3.6-35B-A3B at 184k tokens, vLLM hit 18,060 tokens per second prefill and a 10.2-second time to first token; llama.cpp managed only 7,405 t/s and 24.9 seconds. The gap widened dramatically on a six-GPU MiniMax-M2.7 82k-token test: vLLM delivered 6,212 t/s and 13.2 seconds, llama.cpp fell to 1,065 t/s and 77 seconds, and SGLang crashed entirely because it requires Compute Capability 10.0 for FP4 and lacks a software fallback for older Ada silicon.

On a pure four-GPU Blackwell cluster running Qwen3.5-122B-A10B at 75k tokens, SGLang nearly matched vLLM—14,177 t/s versus 15,084 t/s—while llama.cpp still lagged at 3,662 t/s. The performance gap stems from how each engine handles pipeline execution. Llama.cpp's CPU-side embeddings fragment the execution graph and create pipeline bubbles; SGLang lacks hardware support for older cards. vLLM emulates FP4 on older silicon and supports uneven layer splits via the VLLM_PP_LAYER_PARTITION environment variable, allowing users to balance compute across fast Blackwells and slower 4090s. On the full seven-GPU rig, this tuning pushed a 397B model to 7,683 t/s prefill—a 4–6× speedup over llama.cpp on the same hardware.

More in Research