GTX 1080 runs 30B MoE models at 24 tok/s with expert offloading

A practitioner pushed Qwen 3.6 35B-A3B and Gemma 4 26B-A4B to 24+ tokens per second on an 8 GB Pascal card by offloading cold MoE experts to system RAM and fixing llama.cpp's embedding table placement.

May 15, 2026

GTX 1080 runs 30B MoE models at 24 tok/s with expert offloading

A secondhand GTX 1080 with 8 GB VRAM can run 30-billion-parameter mixture-of-experts models at 24 tokens per second with 128k context, according to a practitioner who documented the setup on a $200 machine this week.

The configuration uses llama.cpp's MoE offloading to park inactive expert weights in system RAM and stream them over PCIe 3.0 x16 to the GPU on demand, while keeping active layers and the KV cache on-device. TurboQuant and RotorQuant KV cache quantization fit the full 128k context window inside the 8 GB framebuffer. The system — an i7-6700 with 32 GB RAM — runs PCIe-bandwidth-limited; GPU utilization hovers around 40-50 percent while the bus saturates.

Model	tok/s	Key flags
Qwen 3.6 35B-A3B	~24	`--n-cpu-moe 30`, K=turbo4 V=turbo3
Gemma 4 26B-A4B (no MTP)	~20	`--n-cpu-moe 20`, K=V=turbo3, `--flash-attn`
Gemma 4 26B-A4B + MTP (fixed)	~24.5	`--override-tensor-draft "token_embd\.weight=CUDA0"`

Gemma 4's multi-token prediction speculative decoding delivered only a 5 percent speedup out of the box. The practitioner traced the bottleneck to llama.cpp unconditionally placing the token embedding table on CPU. Because Gemma 4's MTP assistant uses a tied LM head, every draft token triggers a 262k×1024 matmul across PCIe. Forcing the embedding table onto GPU with lifted the gain to 22 percent and pushed draft acceptance to 79 percent.

More in Community

--override-tensor-draft