vLLM dominates mixed-GPU long-context prefill; SGLang crashes on Ada, llama.cpp 4–6× slower
A seven-GPU benchmark shows vLLM handles heterogeneous Blackwell-Ada clusters and 4-bit pipeline parallelism far better than SGLang or llama.cpp, with SGLang crashing on Ada cards and llama.cpp suffering severe pipeline bubbles.

"vLLM significantly outperforms competing inference engines on heterogeneous GPU clusters," according to benchmarks posted this week on a seven-GPU test rig mixing Blackwell and Ada cards running long-context prefill with pipeline parallelism.
The test setup combined one RTX PRO 6000 96GB, one PRO 5000 48GB, two 5090 32GB, and three modded 4090 48GB cards, all running 4-bit weights—NVFP4 for vLLM and SGLang, MXFP4 for llama.cpp. On a mixed two-GPU run of Qwen3.6-35B-A3B at 184k tokens, vLLM hit 18,060 tokens per second prefill and a 10.2-second time to first token; llama.cpp managed only 7,405 t/s and 24.9 seconds. The gap widened dramatically on a six-GPU MiniMax-M2.7 82k-token test: vLLM delivered 6,212 t/s and 13.2 seconds, llama.cpp fell to 1,065 t/s and 77 seconds, and SGLang crashed entirely because it requires Compute Capability 10.0 for FP4 and lacks a software fallback for older Ada silicon.
On a pure four-GPU Blackwell cluster running Qwen3.5-122B-A10B at 75k tokens, SGLang nearly matched vLLM—14,177 t/s versus 15,084 t/s—while llama.cpp still lagged at 3,662 t/s. The performance gap stems from how each engine handles pipeline execution. Llama.cpp's CPU-side embeddings fragment the execution graph and create pipeline bubbles; SGLang lacks hardware support for older cards. vLLM emulates FP4 on older silicon and supports uneven layer splits via the VLLM_PP_LAYER_PARTITION environment variable, allowing users to balance compute across fast Blackwells and slower 4090s. On the full seven-GPU rig, this tuning pushed a 397B model to 7,683 t/s prefill—a 4–6× speedup over llama.cpp on the same hardware.