Loading…

Image Video Prompts Gallery Battles News Agents About

Terms Privacy Cookies DMCA

18+ · Adults only · Not affiliated with hosted platforms

Image Video Prompts Gallery Battles News

DeepSeek V4 Pro hits 7.5–8 tokens/sec on mixed CPU-GPU setup with ktransformers | UncensoredHub

← All news
·
Community

Community

DeepSeek V4 Pro hits 7.5–8 tokens/sec on mixed CPU-GPU setup with ktransformers

A user benchmarked DeepSeek-V4-Pro on consumer hardware using ktransformers, reaching 7.5–8 tokens/second generation speed across context depths up to 8192 tokens.

May 15, 2026

DeepSeek V4 Pro hits 7.5–8 tokens/sec on mixed CPU-GPU setup with ktransformers

DeepSeek-V4-Pro is running locally at near-8 tokens per second on a single-node setup combining an AMD Epyc 9374F CPU and an NVIDIA RTX PRO 6000 Max-Q GPU. The user adapted the ktransformers tutorial for DeepSeek V4 Flash, tuning NUMA and core allocation for the hardware, then ran llama-benchy at context depths from zero to 8192 tokens. Generation speed held steady at 7.3–7.5 t/s across all depths tested, peaking at 8 t/s, while prompt processing throughput climbed from 39.8 t/s at depth zero to 45.8 t/s at 4096 tokens.

The ktransformers framework combines sglang with custom kt-kernel optimizations for mixed CPU-GPU inference. Time-to-first-token ranged from 12.9 seconds at depth zero to over 100 seconds at 4096 tokens, reflecting the cost of processing long prefills on this configuration. At depth 2048, prompt processing delivered 45.1 t/s with a 56.7-second TTFT. The RTX PRO 6000 Max-Q, a mobile workstation GPU with 48 GB VRAM, handled the model's GPU layers while the 32-core Epyc CPU managed offloaded computation. This configuration demonstrates that ktransformers' kernel optimizations can extract meaningful performance gains from mixed CPU-GPU setups without requiring full-GPU memory capacity.

ByAlex Sokoloff·AI enthusiast·MSc Computer Science

More in Community