Loading…

cyankiwi AWQ 26.05 cuts 4-bit quantization error by 36% on Llama-3.3-70B | UncensoredHub

Releases

cyankiwi AWQ 26.05 cuts 4-bit quantization error by 36% on Llama-3.3-70B

A joint optimization approach to AWQ quantization posted the lowest KL divergence against BF16 baselines across three Llama-3 models, beating standard AWQ, GPTQ, and BitsAndBytes NF4.

May 15, 2026

cyankiwi AWQ 26.05 cuts 4-bit quantization error by 36% on Llama-3.3-70B

cyankiwi AWQ 26.05 is a 4-bit quantization method that jointly optimizes per-channel scales and quantization ranges instead of fitting them sequentially. Standard AWQ picks scales first, then quantization parameters, treating them as independent even though rounding error in one depends on the other. The joint-fit approach measured 0.02826 KL divergence on Llama-3.3-70B-Instruct against the BF16 baseline—36% lower than the next-best competitor, unsloth BNB NF4 at 0.04444, and 41% lower than standard AWQ at 0.04859. Lower KLD means the quantized model's output distribution stays closer to the full-precision original.

The benchmark covered every major 4-bit method: standard AWQ, GPTQ, BitsAndBytes NF4, unsloth's NF4 variant, and NVIDIA's NVFP4. cyankiwi posted the lowest KLD on all three test cases—Llama-3.2-3B-Instruct (0.00510 vs 0.00785 for unsloth NF4), Llama-3.1-8B-Instruct (0.00478 vs 0.00729 for GPTQ), and Llama-3.3-70B-Instruct. The test set was GPQA Diamond responses, a reasoning-heavy eval that amplifies distribution drift.

Model	cyankiwi AWQ	Next-best method	Improvement
Llama-3.2-3B-Instruct	0.00510	0.00785 (unsloth NF4)	35%
Llama-3.1-8B-Instruct	0.00478	0.00729 (GPTQ)	34%
Llama-3.3-70B-Instruct	0.02826	0.04444 (unsloth NF4)

More in Releases