RaPO tackles catastrophic forgetting in visual continual learning
A new reinforcement fine-tuning method addresses forgetting in class-incremental and domain-incremental visual learning by rewarding rollouts that preserve knowledge from prior tasks.

Retention-aware Policy Optimization (RaPO), a reinforcement fine-tuning method from researchers at the University of Hong Kong and Peking University, reduces catastrophic forgetting in visual continual learning. Released this week as a preprint, RaPO outperforms standard supervised fine-tuning and existing reinforcement approaches across five visual continual learning benchmarks, including class-incremental and domain-incremental settings on multimodal large language models.
The core problem RaPO solves is trajectory-level drift agnosticism. When multiple candidate rollouts achieve the same task reward during reinforcement fine-tuning, the policy can drift arbitrarily far from the prior task's distribution, erasing earlier knowledge. RaPO adds a retention reward that penalizes KL divergence from the preceding-task policy at the trajectory level, preferentially reinforcing rollouts that preserve knowledge. A second component, cross-task advantage normalization, maintains an exponential moving average of reward statistics across task boundaries to stabilize optimization as the model encounters new tasks sequentially.
Evaluation spans class-incremental learning (new object categories arriving in sequence), domain-incremental learning (visual domains shifting), and three additional continual learning scenarios. RaPO consistently reduces forgetting while maintaining plasticity—the model's ability to learn new tasks—compared to both supervised fine-tuning and prior reinforcement methods like GRPO. This marks the first systematic study of reinforcement fine-tuning in visual continual learning, filling a gap given recent claims that reinforcement approaches inherently resist forgetting better than supervised training.
Key unknowns: whether RaPO scales to longer task sequences and higher-resolution inputs without drift accumulation overwhelming the retention reward, and whether trajectory-level shaping generalizes beyond vision or could stabilize other policy-gradient methods beyond the GRPO baseline tested here.