ExLlamaV3 triples coding speed with DFlash quantization in five-week sprint
Turboderp's ExLlamaV3 inference engine shipped five releases from late April through May 9, adding Gemma 4 support, DFlash attention, and per-model kernel tuning that triple coding throughput and lift some quantized models by up to 72 percent.

ExLlamaV3, turboderp's open-source inference engine for running quantized models locally, has shipped five releases since late April, culminating in DFlash quantization support and per-model kernel tuning that triple coding throughput and accelerate inference across consumer and workstation GPUs.
The sprint began April 29 with Gemma 4 architecture support. Two weeks later, v0.0.31 landed DFlash attention, a technique that optimizes key-value cache access during generation. On coding workloads, DFlash delivered 3× faster throughput than baseline—177.67 tokens/second versus 59.21 t/s. Agentic code generation hit 2.51× speedup, while creative writing and reasoning tasks saw 1.5–1.58× gains. Translation with reasoning jumped 2.06×.
| Category | Baseline | DFlash | Speedup |
|---|---|---|---|
| Coding | 59.21 t/s | 177.67 t/s | 3.00× |
| Agentic, code | 55.98 t/s | 140.61 t/s | 2.51× |
| Agentic, curl | 54.03 t/s | 125.94 t/s | 2.33× |
| Translation (reasoning) | 58.08 t/s | 119.43 t/s |