Key-Value Means cuts transformer memory with O(N) chunked RNN block-recurrence
A new attention mechanism called KVM offers subquadratic prefill time and sublinear state growth without custom kernels, bridging transformers and linear RNNs.

Key-Value Means is a block-recurrence attention mechanism from Daniel Goldstein and Eugene Cheah that runs with either fixed-size or growable state. The preprint, posted to arXiv on May 12, 2026, describes a transformer architecture that achieves subquadratic prefill time and sublinear memory growth on long-context tasks while still supporting chunk-wise parallelizable training—something traditional quadratic attention and pure linear RNNs each handle differently. KVM layers can replace standard attention on every layer of a model, cutting KV-cache memory and allowing practitioners to tune prefill complexity anywhere between O(N) and O(N²).
The authors trained a transformer with a growable KVM cache and report competitive performance on long-context benchmarks. Because KVM uses standard operations and needs no custom kernels, it slots into existing codebases without low-level rewrites. The mechanism can also run in hybrid mode alongside linear RNN layers, giving models expandable context memory and improved decoding on long sequences. Goldstein and Cheah note that equipping a strong transformer baseline with fixed-size KVM layers produces a strong O(N) chunked RNN while adding only a negligible parameter count.
Code and trained models are available under the Apache 2.0 license at github.com/recursal/KVM-paper and huggingface.co/collections/recursal/key-value-means. The preprint is arXiv:2605.09877.