PROWL trains world models on rare failures via adversarial curriculum
Researchers propose a KL-constrained adversarial loop that forces a policy to expose high-error trajectories in a diffusion world model, then fine-tunes the model on those failures—improving robustness on interaction-critical transitions that passive data misses.
PROWL (Prioritized Regret-Driven Optimization for World Model Learning) is a training method from a new arXiv preprint that addresses a stubborn weakness in action-conditioned video world models: they handle common transitions well but fail on rare, high-stakes interactions that matter most for planning and policy performance. The core idea is to train a policy adversarially to find trajectories where the world model makes large prediction errors, then continuously fine-tune the model on those adversarially discovered failures. A KL constraint keeps the adversarial policy close to the behavior distribution, preventing it from drifting into out-of-distribution exploitation that would generate useless training signal.
The method introduces a Prioritized Adversarial Trajectory buffer that re-ranks collected trajectories by prediction error, action fidelity, and learning progress, ensuring the model focuses on unresolved failure modes rather than repeatedly training on cases it has already learned. The authors implement PROWL in the MineRL framework—a Minecraft-based RL benchmark—and show that models trained with the adversarial curriculum outperform passive-data baselines on held-out out-of-distribution trajectories. The experiments also reveal reward-hacking behaviors when behavioral constraints are too weak, underscoring that effective adversarial training depends on explicit regularization to balance exploratory failure discovery with staying near the data distribution.
The preprint argues that scalable world models benefit not only from larger passive datasets but also from selectively generating informative training data through adversarial elicitation. By converting rare failures into a stable training signal, PROWL offers a path toward world models that remain reliable in the interaction-critical regimes that matter for downstream planning. The next question is whether the approach scales beyond MineRL to higher-dimensional visual domains and whether the KL-constrained adversarial loop can be tightened further without sacrificing the diversity of discovered failures.
