Attention-state memory cuts long-prefix latency 1.36× in LLaMA-3.1-8B inference

Researchers propose a training-free method that externalizes long conditioning prefixes into precomputed lookup tables, improving accuracy and speed over standard in-context learning.

May 18, 2026

Attention-state memory cuts long-prefix latency 1.36× in LLaMA-3.1-8B inference

A new preprint introduces attention-state memory, a training-free technique that speeds up long-context generation in large language models by replacing repeated attention over conditioning prefixes with lightweight lookup tables. The method addresses two structural bottlenecks in prefix-augmented inference: the prefix's influence fades as generation proceeds, and attention computation over the prefix scales linearly with its length.

The approach works by precomputing attention states between prefix and query tokens, then storing them in an external memory that the model can query at inference time without rerunning attention over the full prefix. On ManyICLBench with LLaMA-3.1-8B, attention-state memory improves accuracy over standard in-context learning at memory budgets between 1K and 8K tokens, while reducing attention latency by 1.36× at the 8K mark. On the NBA benchmark, the method surpasses full-attention retrieval-augmented generation performance using only 20 percent of its memory footprint.

Because the technique requires no gradient-based training, it sidesteps the training cost and inflexibility of methods that internalize prefixes into model parameters. It also avoids the linear scaling penalty of compression schemes that still attend to the prefix at inference. The preprint does not specify whether the lookup tables degrade when prefix content changes frequently, or how the method behaves with prefixes longer than 8K tokens. Practitioners working with long-prompt workflows in LLaMA or similar models will want to watch for reference implementations and ablation studies on context lengths beyond the 8K ceiling tested here.

More in Releases