Fully Looped Transformer stabilizes training through 12 iterations
Researchers propose Fully Looped Transformer, a parameter-free architecture modification that stabilizes training for models that reuse Transformer blocks up to 12 times, delivering 13.2% better downstream performance.
Fully Looped Transformer is a training-stability technique that enables models to reuse the same Transformer blocks up to 12 loop iterations without collapsing, according to a preprint posted to arXiv on May 20, 2026. The approach addresses a longstanding problem with Looped Transformers — architectures that trade compute for performance by iteratively applying the same weights — which have historically suffered training instability as loop counts rise. The paper identifies two root causes: gradient oscillation and residual explosion.
Looped Transformers offer a compelling alternative to scaling by parameter count. Instead of building bigger models, they run the same Transformer blocks multiple times, trading additional computation for improved performance without increasing the parameter budget or context length. Because the number of loop iterations can be adjusted at inference, they also provide a natural mechanism for balancing performance against test-time compute. That flexibility makes them attractive for practitioners running models locally, where memory constraints often matter more than raw FLOPs. The catch has been training stability — push the loop count too high, and the model collapses.
The fix introduces two parameter-free modifications. Fully Looped Architecture distributes inter-loop signals across all layers to prevent residual values from exploding. Attention Injection reuses the existing attention block to suppress gradient oscillation. Together, these changes let the model train stably at 12 iterations, a regime where baseline looped models fail. In milder settings where standard Looped Transformers remain stable, Fully Looped Transformer still improves average downstream-task performance by up to 13.2 percent. A single set of weights can be run with fewer loop iterations on constrained hardware or pushed to higher iteration counts when more compute is available, mirroring the broader trend in open-weight models toward runtime configurability.
