Gated DeltaNet-2 decouples memory erase and write for linear-time long-context retrieval
NVIDIA researchers released Gated DeltaNet-2, a linear recurrent attention architecture that separates memory erase and write operations, matching transformer long-context performance at linear compute cost.

Gated DeltaNet-2, a linear recurrent attention mechanism from NVIDIA, splits memory updates into independent erase and write gates, eliminating a fundamental bottleneck in linear models. The architecture, detailed in a preprint published this week on arXiv, addresses the rigid scalar coupling between forgetting old associations and writing new ones that has limited earlier linear attention designs.
The model uses channel-wise erase gates operating on keys and channel-wise write gates on values. NVIDIA researchers derived a chunkwise parallel training form that integrates per-channel decay into asymmetric rank-one erase factors, implemented in custom Triton kernels. At 1.3B parameters pretrained on 100B tokens of FineWeb-Edu, Gated DeltaNet-2 achieves state-of-the-art results in language modeling, commonsense reasoning, and multi-needle retrieval from long context, with training throughput on GPU stable up to 16K token contexts.
What stands out
- 01Independent erase and write operations. Earlier linear recurrent models tied forgetting and writing through a single scalar gate. Gated DeltaNet-2 decouples these into separate per-channel gates—one for keys (erase) and one for values (write)—reducing memory interference at fixed hidden state size.
- 02Chunkwise parallel training. The team derived a mathematical form that allows block-parallel training without sacrificing the recurrent structure. Asymmetric rank-one erase factors fold per-channel decay directly into the parallel computation, enabling efficient GPU utilization on long sequences.
- 03Stable throughput at scale. Training speed remains nearly constant as context length grows from 2K to 16K tokens, a contrast to quadratic attention where throughput collapses. The custom Triton kernels exploit the linear complexity ceiling.
- 04 On multi-needle tasks—retrieving multiple facts scattered across long documents—Gated DeltaNet-2 matches standard transformer performance while maintaining O(n) compute. The decoupled gates let the model overwrite stale associations without destroying useful long-range state.

