Qwen3-VL-8B reaches 39% on multimodal search with on-policy data evolution

A new training framework from Alibaba and collaborators lets visual-native agents reuse intermediate image outputs across search and tool chains, lifting Qwen3-VL-8B from 24.9% to 39.0% on eight benchmarks and beating Gemini-2.5 Pro's 37.9%.

May 13, 2026

Qwen3-VL-8B reaches 39% on multimodal search with on-policy data evolution

On-Policy Data Evolution (ODE) is a training framework that closes the loop between a multimodal agent's rollouts and the data used to train it. In a preprint released May 13, Alibaba researchers show that Qwen3-VL-8B climbs from 24.9% to 39.0% average accuracy across eight multimodal deep search benchmarks when trained with ODE, surpassing Gemini-2.5 Pro's 37.9% in standard agent workflows. At 30B parameters, the same approach lifts the baseline from 30.6% to 41.5%.

The framework introduces an image bank reference protocol that treats every tool-returned image—whether from web search, a screenshot, or a transformation—as an addressable artifact that later tools can re-consume. Existing systems discard intermediate visual outputs, forcing agents to re-request or re-generate evidence. By registering each image with a persistent reference, ODE lets the agent chain visual reasoning steps without losing context. The training data generator then runs in rounds: it collects rollouts from the current policy, refines its curation recipe based on what the agent still struggles with, and produces the next batch of supervised fine-tuning traces or reinforcement learning tasks. The same closed-loop process supports both SFT and RL phases, so the data evolves alongside the model's capability.

Image-bank reuse delivers the largest gains on tasks requiring iterative visual refinement—scenarios where an agent must compare screenshots, overlay annotations, or re-query a transformed image. Rollout-feedback evolution produces more grounded SFT traces and better policy-matched RL tasks than static synthesis recipes, which the authors attribute to the per-round targeting of gaps in the current checkpoint. The Qwen3-VL-8B agent trained with ODE beats Gemini-2.5 Pro on the standard agent-workflow setting, though the paper does not report whether Google's model was fine-tuned on similar data or evaluated under the same image-bank protocol.

What remains open is how the framework scales beyond the eight benchmarks tested and whether the image-bank protocol can be adopted by closed-API agents that do not expose intermediate tool outputs. The next milestone to watch is whether Alibaba releases the harness as a standalone library and whether competing labs publish their own closed-loop data recipes targeting visual-native reasoning.

More in Releases