Alibaba's RoTS-32B tops OSWorld with 800k error-recovery trajectories
Alibaba researchers released GUI-RobustEval, a 1,216-case benchmark for GUI agent error recovery, plus RoTS-7B and RoTS-32B models trained on 800k synthesized recovery steps that now lead OSWorld.
GUI agents that can't recover from their own mistakes rarely survive contact with real users—a problem Alibaba researchers now claim to have addressed at scale. Their new RoTS-32B model, fine-tuned on 800,000 synthesized error-recovery trajectories, achieves a 47.4 percent success rate on OSWorld, the current state-of-the-art for long-horizon desktop automation tasks, alongside a 33.8 percent All-Pass@4 score.
The work introduces two contributions. GUI-RobustEval is a benchmark of 1,216 executable test cases designed to measure how well agents recover from realistic errors—misclicks, wrong menu choices, navigation loops—across a spectrum of failure modes. Robustness-driven Trajectory Synthesis (RoTS) is the data pipeline behind the models: a tree-based framework that proactively discovers diverse error modes in GUI workflows and synthesizes corresponding recovery steps at scale. The resulting 800k-example dataset trains both a 7B and a 32B parameter model, each showing gains on GUI-RobustEval and traditional GUI benchmarks.
The 32B variant's OSWorld performance suggests that teaching agents to recover from long-horizon errors improves not just robustness but overall task completion. Existing GUI agents often fail silently or compound mistakes because training data rarely includes realistic recovery paths. Code, benchmark, and model weights are available on GitHub.



