KV cache edits cut LLM latency 53–398× while preserving accuracy
New arXiv preprint shows attention caches can be edited and spliced without full recompute, preserving logit accuracy above 0.90 while slashing time-to-first-token by up to 14.9× in production workloads.

A preprint posted to arXiv on June 17 demonstrates that large language models write field-conditioned conclusions into their key-value cache at prefill, enabling direct edits and position-portable composition without invalidating downstream tokens. The technique, validated across twelve models spanning four model families, recovers decisions at 1.00 accuracy in 8B-parameter models using roughly 1 percent of the compute required for full recompute.
Prefix caching today reuses prefill only when the entire prefix matches byte-for-byte; changing a single field—a date, a name, a dollar figure—invalidates every cached token downstream. The paper's causal experiments show that overwriting the field's own key-value vectors leaves the model acting on the old value because the field-conditioned conclusion has already propagated into later cache positions at prefill. The authors treat the cache as a notebook of memoized conclusions: a salient erratum appended to the cache amends those notes, and with chain-of-thought prompting, editing the field alone recovers the correct decision. Without CoT, the edit is ignored.
Benchmarks and scope
The approach achieves logit cosine similarity between 0.90 and 0.999 compared to full recompute across twelve models, including quantized checkpoints, Mixture-of-Experts architectures, and multimodal caches. Time-to-first-token scales O(L) rather than O(L²), and a unified edit-plus-compose agent stays decision-identical to recompute at up to 14.9× lower latency. In an online vLLM benchmark, the erratum technique maintains a 98.5 percent prefix cache hit rate while cutting p90 time-to-first-token by 53× to 398×.
The method applies to any per-token attention KV cache and extends to several attention variants through small adapters. Because the erratum is append-only, it composes with production prefix-caching systems without breaking cache alignment. Precompiled skills can be RoPE-repositioned and spliced into arbitrary contexts, indistinguishable from full prefill.



