RTX 5070 beats RTX 3090 on sub-12GB models; reasoning models hide 80% of output
A practitioner benchmarked five backends across Strix Halo, RTX 3090, and RTX 5070 in 55 inference runs, finding GDDR7 bandwidth gives the 5070 a decode edge on smaller models—and revealing reasoning models stream most output through a hidden channel invisible to users.

A practitioner published 55 inference runs across three rigs—Strix Halo, RTX 3090, and RTX 5070—to measure which hardware actually wins on models from 0.35B through 35B parameters. The dataset covers five backends (ROCm, Vulkan, CPU, CUDA, vllm-cuda) and four workload shapes: short-prompt chat, long-context RAG, codegen, and agent tasks at concurrency 1 and 4. Every run is logged as YAML, with three measured iterations after warmup, temperature zero, and VRAM-fit verified before each test.
The RTX 5070's 12GB of GDDR7 beats the RTX 3090's 24GB of GDDR6X on every model that fits in 12GB. Gemma-3-4b chat hit 156.6 tok/s on the 5070 versus 142.0 on the 3090. Gemma-4-E4B reached 124.3 versus 118.4. LFM2-8B-A1B pushed 336.1 versus 318.7. Memory bandwidth drives decode speed more than raw VRAM capacity when the model fits.
What stands out
- 01The 3090 dominates the 14-31GB band. Models too large for 12GB but comfortable in 24GB run faster on the 3090 than on Strix. Gemma-4-26B-A4B chat delivered 100.5 tok/s on the 3090, 47.7 on Strix Vulkan, and 43.7 on Strix ROCm. Qwen3.6-27B chat ran at 21.1 tok/s on the 3090 versus 11.6 on Strix Vulkan and 11.2 on Strix ROCm.
- 02Strix Vulkan edges out Strix ROCm on the same hardware. Gemma-4-26B-A4B showed the biggest gap at +9% (43.7 → 47.7 tok/s). Most models are within a few percent. The likely cause is gfx1151 kernel tuning differences in the bundled ROCm build.
- 03Quantization cost on the 3090 is narrower than expected. Qwen3.6-27B chat ranged from 24.0 tok/s at Q2_K down to 15.3 at Q6_K—a 1.6x spread. Q4_K_M at 21.1 tok/s is the sweet spot. Q2 buys 14% over Q4; Q6 costs 28% for the quality bump.
- 04Reasoning models hide most output. Qwen3.5 and Qwen3.6 stream the bulk of their decode through a hidden channel that counts toward decode rate but never reaches the user. Visible output tok/s looks ~5x slower than the model's actual generation speed—a critical detail when picking a coding assistant.