VaSE cuts reasoning model KV cache memory 4× while boosting accuracy
New training-free method protects high-magnitude value states and adds stochasticity to KV cache eviction, outperforming prior techniques by 4% on reasoning tasks while enabling static memory footprints.

Value-aware Stochastic KV Cache Eviction (VaSE), a training-free technique from researchers at USC, UW, and Salesforce, compresses KV cache memory by 4× in reasoning models without sacrificing accuracy. Posted to arXiv on June 3, the preprint shows Qwen3 models using VaSE beat state-of-the-art selection methods at the same sparsity level and outperform the strongest existing eviction approach by more than 4 percentage points across six reasoning tasks.
Reasoning models generate long chains of thought to improve accuracy, but those extended outputs create a memory bottleneck. Traditional KV cache eviction methods drop key-value pairs to save memory, yet they typically underperform selection-based sparse attention, which keeps the full cache. The VaSE authors identified two failure modes: a small fraction of value states have abnormally large magnitudes, and evicting them sends models into repetitive reasoning loops; and deterministic eviction policies reduce cache diversity, hurting accuracy. VaSE addresses both by protecting large-magnitude value states and introducing stochasticity during eviction decisions. The method works with FlashAttention2 and enables a static memory footprint, making it practical for deployment.
Tested on Qwen3 models, VaSE achieved higher average accuracies than the leading selection method at identical sparsity levels while delivering a 4× compression ratio. The recipe requires no training or fine-tuning, so practitioners can apply it to existing checkpoints immediately. The preprint does not specify whether VaSE has been tested on other model families beyond Qwen3, or how it performs on multimodal reasoning tasks that mix text and vision—watch for follow-up benchmarks on Llama, Mistral, and Phi families, and for integration into inference engines like vLLM and TensorRT-LLM.



