Q-K=V projection merge cuts transformer KV cache 96.9% when paired with GQA
New arXiv paper shows merging Key and Value projections in self-attention halves memory overhead while preserving accuracy, with 96.9% total cache reduction when combined with Grouped Query Attention.

A systematic study from Brainchip researchers challenges the assumption that transformers need three independent Q, K, V projections. Posted to arXiv on June 9, the paper tests three projection-sharing strategies and finds that merging Key and Value matrices—the Q-K=V configuration—cuts KV cache size in half with minimal accuracy drop. On a 1.2B-parameter model, downstream task performance fell just 0.41 percent compared to the standard three-projection baseline.
The work addresses a core bottleneck in autoregressive decoding: the KV cache. Every token generated during inference requires storing Key and Value tensors for all previous tokens, and that memory overhead scales linearly with context length. For long-context applications and edge deployment, the cache quickly becomes the limiting factor. By mathematically demonstrating that Key and Value projections can share weights without breaking the attention mechanism, the authors eliminate the need to cache a separate Value tensor.
The approach is orthogonal to existing head-sharing methods like Grouped Query Attention and Multi-Query Attention, both of which reduce the number of KV heads but still maintain separate K and V projections per head. Stacking Q-K=V on top of GQA or MQA pushes total KV cache reduction to 96.9 percent—a figure that makes multi-million-token inference and smartphone deployment far more practical. The paper includes ablation studies across model scales and confirms that Q-K=V outperforms both Q=K and Q=K=V variants on standard language modeling benchmarks.
For practitioners building new architectures, the Q-K=V configuration is a drop-in change. The authors tested the design on models ranging from small-scale experiments to billion-parameter checkpoints and found consistent memory savings with negligible accuracy regression. Code is available on GitHub under the Brainchip-Inc organization.






