OpenWebRL-4B reaches 67% success on live-web tasks with minimal supervised data

New open framework trains visual web agents via online reinforcement learning on real websites, achieving state-of-the-art open results with just 400 initialization examples.

ByAlex Sokoloff·June 2, 2026

OpenWebRL-4B reaches 67% success on live-web tasks with minimal supervised data

Researchers say online reinforcement learning can break the expensive demonstration bottleneck that has held back open-source visual web agents. OpenWebRL, a new framework detailed in a preprint this week, trains agents directly on live websites using multi-turn RL, achieving state-of-the-art open results with just 0.4K initialization trajectories and 2.2K RL training tasks.

OpenWebRL-4B, the 4-billion-parameter model trained under the framework, scored 67.0% success on Online-Mind2Web and 64.0% on DeepShop—live-web benchmarks that test long-horizon reasoning and interaction with dynamic sites. Those numbers exceed prior open agents at similar or larger scale and remain competitive with OpenAI's CUA and Gemini CUA, both closed systems. The framework covers the full training pipeline: scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization.

Authors Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao, and Hao Cheng systematically examined which design choices make online RL effective for visual web agents and analyzed how RL improves agentic reasoning beyond supervised fine-tuning. The team plans to release training data, models, and code, offering a practical path toward building more capable, reproducible, and cost-efficient open web agents without the scalability bottleneck of curated demonstration datasets.

ZenCreator

OpenWebRL-4B reaches 67% success on live-web tasks with minimal supervised data

More in Releases

Claude Design launches as Anthropic Labs visual collaboration tool

Apple accuses OpenAI of soliciting hardware prototypes in job interviews

Lightweight proxy models cut LLM post-training costs while enabling cross-model signal reuse

Colibri runs 744B GLM-5.2 on 25GB RAM by streaming experts from disk

Anthropic extends Fable 5 preview a second week, bumps rate limits 50%