Video2GUI mines 12M GUI trajectories from tutorial videos for agent training
Researchers released Video2GUI, an automated pipeline that extracts structured GUI interaction data from unlabeled tutorial videos, yielding WildGUI—a 12-million-trajectory dataset spanning 1,500+ applications.

Video2GUI is an automated framework that mines graphical user interface interaction trajectories directly from unlabeled Internet videos. The system addresses a core bottleneck in GUI agent research: the scarcity of large-scale training data covering diverse real-world applications. Traditional GUI datasets rely on expensive manual annotation and typically cover narrow domains; Video2GUI sidesteps both constraints by processing raw tutorial videos at scale.
The pipeline uses a coarse-to-fine filtering strategy to identify high-quality GUI tutorial content and convert it into structured agent trajectories—sequences of grounded actions that an AI agent can learn from. Applied to 500 million video metadata entries, the authors constructed WildGUI, a dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. That scale and diversity represent a step change from prior GUI datasets, which rarely exceed tens of thousands of annotated examples.
Benchmark results
Pretraining Qwen2.5-VL and Mimo-VL on WildGUI delivered consistent 5–20% improvements across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art results. The gains held across both models, suggesting the dataset itself—not model architecture—drives the performance lift. The authors plan to release both the WildGUI dataset and the Video2GUI pipeline, enabling other research groups to replicate the extraction process on their own video corpora or extend the dataset to additional domains.