DeepSeek V4 Pro hits 7.5–8 tokens/sec on mixed CPU-GPU setup with ktransformers
A user benchmarked DeepSeek-V4-Pro on consumer hardware using ktransformers, reaching 7.5–8 tokens/second generation speed across context depths up to 8192 tokens.

DeepSeek-V4-Pro is running locally at near-8 tokens per second on a single-node setup combining an AMD Epyc 9374F CPU and an NVIDIA RTX PRO 6000 Max-Q GPU. The user adapted the ktransformers tutorial for DeepSeek V4 Flash, tuning NUMA and core allocation for the hardware, then ran llama-benchy at context depths from zero to 8192 tokens. Generation speed held steady at 7.3–7.5 t/s across all depths tested, peaking at 8 t/s, while prompt processing throughput climbed from 39.8 t/s at depth zero to 45.8 t/s at 4096 tokens.
The ktransformers framework combines sglang with custom kt-kernel optimizations for mixed CPU-GPU inference. Time-to-first-token ranged from 12.9 seconds at depth zero to over 100 seconds at 4096 tokens, reflecting the cost of processing long prefills on this configuration. At depth 2048, prompt processing delivered 45.1 t/s with a 56.7-second TTFT. The RTX PRO 6000 Max-Q, a mobile workstation GPU with 48 GB VRAM, handled the model's GPU layers while the 32-core Epyc CPU managed offloaded computation. This configuration demonstrates that ktransformers' kernel optimizations can extract meaningful performance gains from mixed CPU-GPU setups without requiring full-GPU memory capacity.