OCTOPUS KV cache codec beats TurboQuant and PolarQuant across modalities
New preprint from Boss, Voleti, Donné, and Vainer shows joint quantization of rotated coordinate triplets beats prior rotation-based KV compression at every bit width tested.

OCTOPUS, a new KV cache compression codec, quantizes rotated coordinate triplets jointly rather than one coordinate at a time. Researchers Mark Boss, Vikram Voleti, Simon Donné, and Shimon Vainer map each triplet's direction to a square via octahedral parameterization, then apply Lloyd-Max quantization to the two resulting coordinates and the triplet norm. The codec is data-oblivious, online, and deterministic given a seed—it works without training or per-batch calibration.
Posted to arXiv on May 21, OCTOPUS matches or beats every prior rotation codec—TurboQuant and PolarQuant among them—at every reported bit width and metric across text, video, and audio workloads. The quality lead grows as bit budgets drop, making the codec especially useful for extreme compression scenarios where memory bandwidth is the bottleneck. A fused Triton implementation reconstructs keys on the fly without materializing the uncompressed tensor, so the codec adds no decode-time bandwidth or latency over existing dequantization paths.
The paper finds that the finite-dimensional quality optimum for per-triplet bit allocation is constant across every real decoder tested, a result that simplifies deployment. The non-uniform bit allocation depends only on the total dimensionality of the keys, not on the data distribution, which means the same codec parameters work across modalities. The octahedral parameterization itself is the key structural innovation—it lets the codec exploit the geometry of rotated triplets without the coordinate-by-coordinate independence assumption that limits earlier rotation-preconditioned methods. The next question is whether the Triton kernel generalizes to other hardware backends and whether the constant bit-allocation result holds for models outside the test set—particularly for very large context windows and for multimodal architectures where key and value dimensions vary widely.