DeepSeek-V4-Flash runs at 255 tokens/sec on four used RTX 2080 Ti cards
A $2,500 build using four legacy RTX 2080 Ti cards delivers 255 prefill tokens per second on DeepSeek-V4-Flash through custom CUDA kernels, W8A8 quantization, and heterogeneous memory offloading.

DeepSeek-V4-Flash, the 284B-parameter mixture-of-experts model with 13B active parameters, now runs on four RTX 2080 Ti GPUs at 255 prefill tokens per second. The $2,500 build pairs the legacy cards—11GB VRAM each—with an Intel Xeon E5-2696 v4 CPU and 1TB of DDR4 ECC system RAM. Custom CUDA kernels rewritten for Turing architecture accelerate W8A8 INT8 matrix multiplication, sidestepping the PCIe Gen3 and VRAM bandwidth bottlenecks that normally choke consumer hardware on frontier MoE workloads. The implementation splits model weights across GPU VRAM and system RAM, overlapping computation with multi-GPU communication to hide MoE routing overhead.
The code, deployment scripts, and a technical report are open-sourced on GitHub. The author submitted a detailed preprint to arXiv and expects publication once manual review clears. The 2080 Ti cards, released in 2018 at $1,199 each, now sell used for under $400—making this setup accessible to practitioners who already own older hardware.