ProCUA-SFT dataset pushes UI-TARS 7B to 45% OSWorld success with 3.1M synthetic desktop trajectories
A new preprint details ProCUA-SFT, a 3.1M-sample dataset of synthetic desktop-agent trajectories that lifts UI-TARS 7B's OSWorld score 18.7 points above baseline—reversing the negative transfer seen with human-collected data.

ProCUA-SFT is a supervised fine-tuning dataset of 3.1 million step-level samples distilled from 93,000 synthetic desktop-agent trajectories, released this week in a technical report on arXiv. The dataset targets computer-use agents—models that navigate graphical desktops via screenshots and keyboard/mouse actions—and was built to address a surprising failure mode: the largest public human-trajectory resource, AgentNet (22,500 trajectories), causes OSWorld success rates to collapse from 26.3 percent to 8–10 percent when used for continued training of UI-TARS 7B. Fine-tuning the same 7B model on ProCUA-SFT for a single epoch instead reaches 45.0 percent on OSWorld, an 18.7 percentage-point gain over the base checkpoint and more than 35 points above any AgentNet-trained variant.
The pipeline that produces ProCUA-SFT runs entirely on live desktop environments seeded with real-world artifacts: 912 spreadsheets from SpreadsheetBench, roughly 10,000 permissively licensed presentations from Zenodo10K, and multi-application configurations drawn from OSWorld. A single vision-language model, Kimi-K2.5, acts as goal generator, precondition judge, and trajectory executor in one pass, eliminating the capability mismatch that arises when a planner and an actor are separate models. Each task is verified for feasibility through binary precondition checks before rollout, and every resulting trajectory is expanded into step-prefix samples that mirror the exact context layout an agent sees at inference time. The 93,000 trajectories span 2,484 distinct application combinations.
A subset of ProCUA has already been folded into the training mix for Nemotron 3 Nano Omni, contributing to that model's computer-use capabilities. The authors note that negative transfer from human data remains an open puzzle—AgentNet's scale advantage evaporates under fine-tuning, while synthetic data at a fraction of the trajectory count delivers consistent gains. Whether the gap stems from annotation noise, task distribution mismatch, or something deeper in the human-demonstration signal is still unclear. The next question is whether ProCUA's automated synthesis approach scales to longer-horizon tasks and whether other base models show the same dramatic lift, or if UI-TARS 7B's architecture happens to be particularly receptive to step-prefix formatting.



