ProCUA-SFT dataset pushes UI-TARS 7B to 45% OSWorld success with 3.1M synthetic desktop trajectories

A new preprint details ProCUA-SFT, a 3.1M-sample dataset of synthetic desktop-agent trajectories that lifts UI-TARS 7B's OSWorld score 18.7 points above baseline—reversing the negative transfer seen with human-collected data.

ByAlex Sokoloff·June 17, 2026

ProCUA-SFT dataset pushes UI-TARS 7B to 45% OSWorld success with 3.1M synthetic desktop trajectories

ProCUA-SFT is a supervised fine-tuning dataset of 3.1 million step-level samples distilled from 93,000 synthetic desktop-agent trajectories, released this week in a technical report on arXiv. The dataset targets computer-use agents—models that navigate graphical desktops via screenshots and keyboard/mouse actions—and was built to address a surprising failure mode: the largest public human-trajectory resource, AgentNet (22,500 trajectories), causes OSWorld success rates to collapse from 26.3 percent to 8–10 percent when used for continued training of UI-TARS 7B. Fine-tuning the same 7B model on ProCUA-SFT for a single epoch instead reaches 45.0 percent on OSWorld, an 18.7 percentage-point gain over the base checkpoint and more than 35 points above any AgentNet-trained variant.

The pipeline that produces ProCUA-SFT runs entirely on live desktop environments seeded with real-world artifacts: 912 spreadsheets from SpreadsheetBench, roughly 10,000 permissively licensed presentations from Zenodo10K, and multi-application configurations drawn from OSWorld. A single vision-language model, Kimi-K2.5, acts as goal generator, precondition judge, and trajectory executor in one pass, eliminating the capability mismatch that arises when a planner and an actor are separate models. Each task is verified for feasibility through binary precondition checks before rollout, and every resulting trajectory is expanded into step-prefix samples that mirror the exact context layout an agent sees at inference time. The 93,000 trajectories span 2,484 distinct application combinations.

A subset of ProCUA has already been folded into the training mix for Nemotron 3 Nano Omni, contributing to that model's computer-use capabilities. The authors note that negative transfer from human data remains an open puzzle—AgentNet's scale advantage evaporates under fine-tuning, while synthetic data at a fraction of the trajectory count delivers consistent gains. Whether the gap stems from annotation noise, task distribution mismatch, or something deeper in the human-demonstration signal is still unclear. The next question is whether ProCUA's automated synthesis approach scales to longer-horizon tasks and whether other base models show the same dramatic lift, or if UI-TARS 7B's architecture happens to be particularly receptive to step-prefix formatting.

ZenCreator

ProCUA-SFT dataset pushes UI-TARS 7B to 45% OSWorld success with 3.1M synthetic desktop trajectories

More in Releases

Google's AMIE matches physicians in chronic disease management, Nature study finds

Anthropic opens Seoul office, expands Claude partnerships across Korea

Supervised Memory Training lets RNNs learn in parallel without backprop through time

PROPEL doubles learnable task generation for code agents without solver rollouts

O'Reilly preprint: mammalian cortex approximates backpropagation via 200-millisecond theta cycles