Omni-DuplexEval benchmark reveals real-time multimodal models fail at streaming interaction
New benchmark from researchers exposes how poorly current multimodal models handle streaming inputs and proactive responses, with best systems scoring under 40% overall.
Omni-DuplexEval is a benchmark for evaluating real-time duplex multimodal interaction—the ability of AI systems to process streaming video and audio while generating responses at appropriate moments. Most multimodal large language models are tested in offline settings where the entire input is available before any response, not in real-world scenarios where models must react to continuous streams. Researchers Chaoqun He, Mingyang Xiang, Yingjing Xu, Bokai Xu, Junbo Cui, and Jie Zhou built the benchmark to fill this gap, assembling 660 videos with human-annotated labels and temporal metadata across nine real-world tasks.
The benchmark splits into two scenarios. Real-Time Description measures whether models generate continuous, time-aligned responses as inputs evolve. Proactive Reminder tests whether models can identify salient events and respond without explicit prompting. All questions are open-ended. The researchers built an automatic evaluation framework using LLM-as-a-Judge that assesses both response content and timing through timestamp-aware reasoning, achieving strong alignment with human judgments.
State-of-the-art duplex multimodal models fail substantially. The best-performing system reaches only 39.6% overall accuracy, and just 20.0% on Proactive Reminder. The researchers identify two core failures: models cannot balance timely responses with coherent content generation, and they fail to determine both when to respond and what to say.
