MELT breaks memory scaling in recurrent language models with shared KV cache
Memory-Efficient Looped Transformer shares a single KV cache per layer across reasoning loops, breaking the linear memory-depth scaling that limits recurrent language models.

Recurrent language models that reason by looping over embeddings—without generating intermediate tokens—hit a memory wall when they scale up reasoning depth. Each iteration adds to a growing key-value cache, forcing memory consumption to climb linearly with the number of reasoning steps. A new preprint proposes Memory-Efficient Looped Transformer (MELT), an architecture that holds memory constant regardless of reasoning depth.
MELT maintains one KV cache per layer, shared across all reasoning iterations. A learnable gating mechanism updates that cache over time instead of appending a fresh cache for each loop. The result: memory consumption stays flat as reasoning depth increases, while standard looped models like Ouro see linear growth. According to the paper, MELT models fine-tuned from pretrained Ouro checkpoints outperform comparable-size standard LLMs on reasoning tasks, with a memory footprint matching those non-looped baselines and dramatically smaller than Ouro's.
The training procedure combines two phases of post-training: interpolated transition from a LoopLM starting model, followed by attention-aligned distillation. Both phases use chunk-wise training to stabilize learning under the shared-cache design. The authors emphasize that MELT requires only lightweight post-training on top of an existing looped checkpoint, making it a practical retrofit for recurrent architectures already in use.
No code or weights have been released yet.