SEATS cuts Qwen omni-modal inference by 9.3× with layer-wise token pruning
A training-free method drops 90% of audio-visual tokens across LLM layers, delivering 4.8× prefill speedup on Qwen models while keeping 96.3% of original accuracy.

SEATS, a training-free token selection technique from researchers at Alibaba and Tsinghua University, accelerates omni-modal large language models by pruning audio and visual tokens layer by layer. Applied to Qwen2.5-Omni and Qwen3-Omni, the method retains just 10% of non-textual tokens yet preserves 96.3% of baseline performance, cutting prefill-stage FLOPs by 9.3× and wall-clock time by 4.8×. The preprint was released this week on HuggingFace Papers.
Omni-modal LLMs encode video and audio into temporally aligned token sequences that run through every transformer layer alongside text. That density drives compute costs skyward—existing pruning approaches either ignore audio entirely or apply a fixed per-modality ratio before the LLM starts, missing how cross-modal dependencies shift as information flows deeper. The authors traced token attention weights across Qwen2.5-Omni's 72 layers and found visual and audio dependencies follow a block-wise pattern: early layers fuse modalities, middle layers stabilize the representation, and late layers rely almost entirely on text once fusion is complete. Many audio-visual tokens become redundant after the first third of the network.
Layer-wise pruning strategy
SEATS operates in three stages. Before the LLM ingests tokens, it removes spatiotemporal redundancy using attention-weighted diversity selection—frames and audio windows that carry similar information get merged. Inside the LLM, the method prunes progressively across transformer blocks, reallocating the retention budget from temporal windows to individual modalities based on query relevance scores; a frame that matters for the current question keeps its tokens, while irrelevant windows lose theirs. In the final third of layers, SEATS drops all remaining non-textual tokens, leaving only the fused text representation to finish inference.
On eight video-understanding and audio-visual benchmarks, Qwen2.5-Omni at 10% token retention drops from 100% to 96.3% average accuracy while FLOPs fall to 10.7% of the original count. Prefill latency on a single A100 GPU shrinks from 1.0× to 0.21× of the uncompressed run. Fixed-ratio baselines—keeping the same 10% budget but pruning uniformly across all layers—recover only 91.8% of accuracy, confirming that stage-aware scheduling outperforms uniform pruning.