VideoSeeker outperforms GPT-4o on instance-level video tasks with visual prompts
A new vision-language model uses visual prompts and reinforcement learning to outperform GPT-4o and Gemini-2.5-Pro on spatiotemporal video understanding benchmarks.

VideoSeeker, a vision-language model from researchers led by Yiming Zhao, treats instance-level video understanding as an agentic tool-calling problem. Instead of relying on text prompts to specify spatial and temporal regions—which struggle to pinpoint exact frames or objects—the model accepts visual prompts and autonomously retrieves relevant video segments on demand. The approach internalizes tool-calling and proactive perception through a four-stage automated data synthesis pipeline, cold-start supervision, and reinforcement learning training. On instance-level video understanding benchmarks, VideoSeeker achieves an average improvement of 13.7 percent over baseline methods and surpasses closed-source models including GPT-4o and Gemini-2.5-Pro.
Large vision-language models have made steady progress on general video question-answering, but tasks requiring precise spatiotemporal localization at the instance level remain a weak point. Existing methods center reasoning around language tokens, which limits the model's ability to proactively perceive fine-grained visual evidence. VideoSeeker addresses this by integrating agentic reasoning directly into the perception loop: the model can invoke tools to retrieve specific segments when it needs more information, rather than waiting for a human to craft a more detailed text prompt. The visual prompt interface also improves user experience by letting practitioners point directly at regions of interest in a video frame. The team reports effective transferability to general video understanding tasks, suggesting the agentic approach does not sacrifice broader capability. The preprint was posted May 19, 2026, and the authors plan to release datasets and code publicly.