ZenCreator

Pro-grade AI content creation. Image, video, face-swap, lipsync, and upscaling behind one API.

14 tools

Up to 4K

4.4(288)

Visit

Loading…

Gated DeltaNet-2 decouples memory erase and write for linear-time long-context retrieval | UncensoredHub

Research

Gated DeltaNet-2 decouples memory erase and write for linear-time long-context retrieval

NVIDIA researchers released Gated DeltaNet-2, a linear recurrent attention architecture that separates memory erase and write operations, matching transformer long-context performance at linear compute cost.

ByAlex Sokoloff·May 28, 2026

Gated DeltaNet-2 decouples memory erase and write for linear-time long-context retrieval

Gated DeltaNet-2, a linear recurrent attention mechanism from NVIDIA, splits memory updates into independent erase and write gates, eliminating a fundamental bottleneck in linear models. The architecture, detailed in a preprint published this week on arXiv, addresses the rigid scalar coupling between forgetting old associations and writing new ones that has limited earlier linear attention designs.

The model uses channel-wise erase gates operating on keys and channel-wise write gates on values. NVIDIA researchers derived a chunkwise parallel training form that integrates per-channel decay into asymmetric rank-one erase factors, implemented in custom Triton kernels. At 1.3B parameters pretrained on 100B tokens of FineWeb-Edu, Gated DeltaNet-2 achieves state-of-the-art results in language modeling, commonsense reasoning, and multi-needle retrieval from long context, with training throughput on GPU stable up to 16K token contexts.

What stands out

01Independent erase and write operations. Earlier linear recurrent models tied forgetting and writing through a single scalar gate. Gated DeltaNet-2 decouples these into separate per-channel gates—one for keys (erase) and one for values (write)—reducing memory interference at fixed hidden state size.
02Chunkwise parallel training. The team derived a mathematical form that allows block-parallel training without sacrificing the recurrent structure. Asymmetric rank-one erase factors fold per-channel decay directly into the parallel computation, enabling efficient GPU utilization on long sequences.
03Stable throughput at scale. Training speed remains nearly constant as context length grows from 2K to 16K tokens, a contrast to quadratic attention where throughput collapses. The custom Triton kernels exploit the linear complexity ceiling.
04 On multi-needle tasks—retrieving multiple facts scattered across long documents—Gated DeltaNet-2 matches standard transformer performance while maintaining O(n) compute. The decoupled gates let the model overwrite stale associations without destroying useful long-range state.

ZenCreator

Gated DeltaNet-2 decouples memory erase and write for linear-time long-context retrieval

What stands out

More in Research

Apple accuses OpenAI of soliciting hardware prototypes in job interviews

Lightweight proxy models cut LLM post-training costs while enabling cross-model signal reuse

Colibri runs 744B GLM-5.2 on 25GB RAM by streaming experts from disk

Anthropic extends Fable 5 preview a second week, bumps rate limits 50%

Soofi S 30B activates 3B parameters per token, tops European AI baselines