OScaR compresses LLM Key-Value cache to 2 bits with 3× speedup

A new preprint shows that rotating and scaling token norms before quantization lets LLMs compress their Key-Value cache to 2-bit integers with near-lossless accuracy across text and multimodal models.

ByAlex Sokoloff·May 18, 2026

OScaR compresses LLM Key-Value cache to 2 bits with 3× speedup

Extreme KV cache quantization has long traded accuracy for memory savings, but a new technique pushes past that boundary. OScaR (Omni-Scaled Canalized Rotation), presented in a preprint released May 20, compresses the Key-Value cache in large language models down to INT2—2 bits per parameter—while preserving near-lossless performance across text-only, multimodal, and omni-modal architectures.

The paper identifies Token Norm Imbalance (TNI) as the core bottleneck that breaks existing per-channel quantization at low bit-widths. When token groups in a sequence have wildly different norms, shared quantization parameters amplify errors across the sequence. OScaR addresses this by applying a canalized rotation to the Key and Value tensors, then scaling each token's norm individually before quantizing. The authors argue this two-step preprocessing is simpler and faster than prior pipelines like TurboQuant, which rely on iterative search or mixed-precision schemes.

Benchmarked against BF16 FlashDecoding-v2, OScaR's INT2 implementation delivers 3.0× faster decoding, cuts memory footprint by 5.3×, and lifts throughput by 4.1×. The preprint shows the method working across text LLMs and multimodal models without architecture-specific tuning. Optimized CUDA kernels and system-level integration are included in the open-source release.

Long-context and multimodal inference push KV cache sizes into the tens of gigabytes, making memory a hard constraint. Standard per-channel quantization handles channel-wise outliers in Key tensors but collapses under 2-bit compression. OScaR's rotation-and-scaling step redistributes variance so that INT2 parameters can span the full token sequence without clipping high-norm tokens or underutilizing low-norm ones. The authors position it as a universal framework that sets a new Pareto front for accuracy versus complexity in KV cache quantization.

ZenCreator

OScaR compresses LLM Key-Value cache to 2 bits with 3× speedup

More in Releases

Avito launches year-long Data Science Bootcamp with ML and NLP tracks

HuggingFace Jobs launches one-command vLLM deployment on H100 and A100

Gemma 4 voice AI hits sub-100ms latency on Cerebras wafer-scale chips

Hugging Face embeds 200+ benchmark scores directly on model cards

NVIDIA NeMo AutoModel cuts fine-tuning setup time for Llama, Mistral, Gemma