ZenCreator

Pro-grade AI content creation. Image, video, face-swap, lipsync, and upscaling behind one API.

14 tools

Up to 4K

4.4(288)

Visit

Loading…

KV cache edits cut LLM latency 53–398× while preserving accuracy | UncensoredHub

Research

KV cache edits cut LLM latency 53–398× while preserving accuracy

New arXiv preprint shows attention caches can be edited and spliced without full recompute, preserving logit accuracy above 0.90 while slashing time-to-first-token by up to 14.9× in production workloads.

ByAlex Sokoloff·June 19, 2026

KV cache edits cut LLM latency 53–398× while preserving accuracy

A preprint posted to arXiv on June 17 demonstrates that large language models write field-conditioned conclusions into their key-value cache at prefill, enabling direct edits and position-portable composition without invalidating downstream tokens. The technique, validated across twelve models spanning four model families, recovers decisions at 1.00 accuracy in 8B-parameter models using roughly 1 percent of the compute required for full recompute.

Prefix caching today reuses prefill only when the entire prefix matches byte-for-byte; changing a single field—a date, a name, a dollar figure—invalidates every cached token downstream. The paper's causal experiments show that overwriting the field's own key-value vectors leaves the model acting on the old value because the field-conditioned conclusion has already propagated into later cache positions at prefill. The authors treat the cache as a notebook of memoized conclusions: a salient erratum appended to the cache amends those notes, and with chain-of-thought prompting, editing the field alone recovers the correct decision. Without CoT, the edit is ignored.

Benchmarks and scope

The approach achieves logit cosine similarity between 0.90 and 0.999 compared to full recompute across twelve models, including quantized checkpoints, Mixture-of-Experts architectures, and multimodal caches. Time-to-first-token scales O(L) rather than O(L²), and a unified edit-plus-compose agent stays decision-identical to recompute at up to 14.9× lower latency. In an online vLLM benchmark, the erratum technique maintains a 98.5 percent prefix cache hit rate while cutting p90 time-to-first-token by 53× to 398×.

The method applies to any per-token attention KV cache and extends to several attention variants through small adapters. Because the erratum is append-only, it composes with production prefix-caching systems without breaking cache alignment. Precompiled skills can be RoPE-repositioned and spliced into arbitrary contexts, indistinguishable from full prefill.

ZenCreator

KV cache edits cut LLM latency 53–398× while preserving accuracy

Benchmarks and scope

More in Research

Anthropic opens Seoul office, expands Claude partnerships across Korea

Supervised Memory Training lets RNNs learn in parallel without backprop through time

PROPEL doubles learnable task generation for code agents without solver rollouts

O'Reilly preprint: mammalian cortex approximates backpropagation via 200-millisecond theta cycles

DF3DV-1K dataset ships 1,048 scenes for distractor-free 3D reconstruction