Qwen 27B reaches 65 tokens/sec on single RTX 3090 with Q5_K_S quantization

A practitioner running Qwen 27B on a single RTX 3090 reported 65 tokens per second using Q5_K_S quantization and llama.cpp's draft-MTP speculative decoding, balancing throughput against reasoning fidelity on a 24GB card.

May 16, 2026

Qwen 27B reaches 65 tokens/sec on single RTX 3090 with Q5_K_S quantization

Running Qwen 27B on a single RTX 3090 hits 65 tokens per second with Q5_K_S quantization and llama.cpp's draft-MTP speculative decoding. The config offloads all layers to VRAM with -ngl -1, sets both cache key and cache value quantization to Q8_0, and caps speculative tokens at 2 per step. At a 65,536-token context window, the setup requires frequent cache compaction—a tradeoff the practitioner accepted to avoid dropping to Q4 quantization, which would save memory but degrade reasoning accuracy.

The multimodal projector stays on CPU with --no-mmproj-offload, while 8 CPU threads handle offloaded work. The --chat-template-kwargs "{\"preserve_thinking\": true}" parameter preserves the model's chain-of-thought reasoning in outputs. The --fit off flag disables automatic context fitting, forcing manual management of the window. Community guides for 3090 owners typically recommend Q4 quants as a baseline for 27B-class models, but Q5_K_S preserves more of the original weights—meaningful for long-context tasks where the model must track state across tens of thousands of tokens. The tradeoff is tighter memory margins and compaction overhead versus better fidelity.

Qwen 27B is part of Alibaba's open-weight Qwen 3.6 series, which supports native multimodal input and million-token context in full precision. The 24GB VRAM ceiling on a 3090 makes quantization mandatory for models above 20B parameters. At 65 tokens per second, the config is fast enough for interactive use, though compaction pauses remain a friction point when the context window fills.

More in Community