AuralSAM2 adds audio prompts to SAM2 video segmentation via pyramid fusion
AuralSAM2 integrates audio into Meta's Segment Anything Model 2 for video segmentation, using pyramid feature fusion to avoid adapter overhead and audio signal degradation.

Researchers including Yuyuan Liu and Yuanhong Chen have published AuralSAM2, a method that integrates audio prompts into Meta's Segment Anything Model 2 (SAM2) for video segmentation. The work addresses a core limitation in existing audio-visual approaches: adapter-based methods that inject audio into SAM2's image encoder suffer from "audio prompt dilution," where the audio signal weakens as it propagates through the network, while also adding inference overhead that undermines SAM2's interactive speed.
AuralSAM2 introduces AuralFuser, a module that fuses audio and visual features to generate sparse and dense prompts. Built on SAM2's feature pyramid, these prompts propagate auditory cues across visual layers without modifying the base model. The team adds an audio-guided contrastive loss to emphasize auditory relevance in dominant visual features, aligning the two modalities. The result is accuracy gains on public benchmarks with minimal impact on interactive segmentation speed—the core trade-off that adapter-based methods fail to balance.
Code is available at github.com/yyliu01/AuralSAM2. The preprint was posted on May 18, 2026.