Dual 3090 user questions speculative decoding speedup claims with 40–50 t/s results

A LocalLLaMA user running Qwen3.6-27B on dual 3090s with P2P communication reports 40–50 tokens per second using DFlash and MTP speculative decoding, but questions whether community benchmarks reflect realistic gains.

May 16, 2026

Dual 3090 user questions speculative decoding speedup claims with 40–50 t/s results

A user running dual NVIDIA 3090s with peer-to-peer GPU communication is achieving 40–50 tokens per second on Qwen3.6-27B using DFlash and MTP speculative decoding, but falling short of the 2–3× speedups circulating in community discussions. The setup pairs an AMD 9900X CPU with 32GB RAM on Ubuntu 24.04, CUDA 13.0, and a forked NVIDIA driver to enable P2P transfers between GPUs. The user confirmed GPU topology with nvidia-smi topo -p2p r before testing both acceleration methods.

Using DFlash via the beellama fork with 3090-specific parameters, the user loaded spiritbuun draft weights and an unsloth Q5_K_S target model, achieving roughly 40 t/s. With MTP on the latest llama.cpp, testing unsloth's Qwen3.6-27B UD-Q4_K_XL and UD-Q8_K_XL checkpoints using --spec-type draft-mtp --spec-draft-n-max 6 yielded about 50 t/s. Both results are only marginally faster than the 40 t/s baseline from standard Qwen3.5-27B inference without speculative decoding. The user's configuration includes --flash-attn on, quantized KV caches (q8_0 for MTP, turbo4 for DFlash), and context windows up to 245,600 tokens, yet the promised multiples remain unrealized. The post reflects uncertainty about whether the gap stems from misconfiguration, hardware topology limits, or inflated community benchmarks.

More in Community