OmniBoost pushes 3B omni-modal model to 30B performance with staged post-training
Researchers cleaned nine omni-modal benchmarks of visual shortcuts, then used staged post-training—bi-modal SFT, RLVR, and self-distillation—to push Qwen2.5-Omni-3B to performance on par with a 30B model.

A team led by Che Liu has released OmniBoost, a three-stage post-training recipe that brings a 3-billion-parameter omni-modal language model to performance levels matching a 30-billion-parameter baseline—without relying on a stronger teacher. The work, detailed in a preprint released this week, also introduces OmniClean, a cleaned evaluation suite that filters out queries solvable by vision alone.
Omni-modal models are designed to integrate audio, visual, and language inputs, but the authors found that many benchmark questions can be answered using only visual evidence. They audited 16,968 queries across nine omni-modal benchmarks, removed visually solvable items, and retained 8,551 queries where audio-visual-language integration is genuinely required. The resulting OmniClean suite exposes how much prior benchmark gains were inflated by visual shortcuts.
On post-training stages
OmniBoost starts with Qwen2.5-Omni-3B and applies three phases. Mixed bi-modal supervised fine-tuning (SFT) yields limited and uneven gains. Reinforcement learning with visual rewards (RLVR) delivers the first broad improvement across benchmarks. SFT on self-distilled data—queries generated by the model itself—reshapes the benchmark profile and pushes aggregate performance slightly above Qwen3-Omni-30B-A3B-Instruct.
The authors argue that visually debiased evaluation makes omni-modal progress easier to interpret and that small models can benefit from staged post-training with self-distilled omni-query supervision. Full details and the project page are available at cheliu-computation.github.io/omni.