Loading…

ExLlamaV3 triples coding speed with DFlash quantization in five-week sprint | UncensoredHub

Platform

ExLlamaV3 triples coding speed with DFlash quantization in five-week sprint

Turboderp's ExLlamaV3 inference engine shipped five releases from late April through May 9, adding Gemma 4 support, DFlash attention, and per-model kernel tuning that triple coding throughput and lift some quantized models by up to 72 percent.

May 12, 2026

ExLlamaV3 triples coding speed with DFlash quantization in five-week sprint

ExLlamaV3, turboderp's open-source inference engine for running quantized models locally, has shipped five releases since late April, culminating in DFlash quantization support and per-model kernel tuning that triple coding throughput and accelerate inference across consumer and workstation GPUs.

The sprint began April 29 with Gemma 4 architecture support. Two weeks later, v0.0.31 landed DFlash attention, a technique that optimizes key-value cache access during generation. On coding workloads, DFlash delivered 3× faster throughput than baseline—177.67 tokens/second versus 59.21 t/s. Agentic code generation hit 2.51× speedup, while creative writing and reasoning tasks saw 1.5–1.58× gains. Translation with reasoning jumped 2.06×.

Category	Baseline	DFlash	Speedup
Coding	59.21 t/s	177.67 t/s	3.00×
Agentic, code	55.98 t/s	140.61 t/s	2.51×
Agentic, curl	54.03 t/s	125.94 t/s	2.33×
Translation (reasoning)	58.08 t/s	119.43 t/s

More in Platform