Fast-Slow Training cuts LLM catastrophic forgetting by 70% while tripling sample efficiency
Berkeley researchers propose a dual-weight training framework that treats model parameters as 'slow' weights and optimized context as 'fast' weights, preserving plasticity and general reasoning while adapting to new tasks.

Fast-Slow Training (FST) is a continual learning framework from UC Berkeley and Databricks that splits LLM adaptation into two timescales: model parameters as "slow" weights that preserve general reasoning, and optimized context as "fast" weights that absorb task-specific information from textual feedback. The approach, detailed in a preprint released May 13, 2026, by Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal, Joseph E. Gonzalez, Matei Zaharia, and Kurt Keutzer, reaches up to 3× better sample efficiency than parameter-only reinforcement learning on reasoning tasks while maintaining 70% less KL divergence from the base model.
Traditional RL fine-tuning forces all task knowledge into model weights, which leads to catastrophic forgetting when domains shift and a gradual loss of plasticity—the model's ability to learn subsequent tasks effectively. In-context learning avoids weight updates but typically underperforms tuned models. FST combines both: the slow weights drift minimally from the base checkpoint, preserving broad capabilities, while the fast weights—optimized prompts or context vectors—adapt rapidly to the current task without baking in domain-specific biases. In continual learning scenarios where task distributions change on the fly, FST-trained models continue acquiring each new task while parameter-only RL stalls.
The preprint reports that FST models not only reach higher performance asymptotes than RL-only training but also transfer more successfully to a second task after initial training, a direct measure of preserved plasticity. The reduced parameter drift means the model stays closer to its pretrained reasoning behaviors, which matters for deployment: a model that forgets less can be retasked without full retraining. The authors frame the dual-weight design as analogous to human System 1 and System 2 thinking—fast pattern matching layered over slower deliberative processes.
Code and trained checkpoints have not yet been released. Practitioners should watch for a public implementation and ablation studies on larger models to see whether the 3× efficiency gain holds at 70B+ parameter scales and whether the approach generalizes beyond reasoning tasks to multimodal or long-context adaptation.