AudioMosaic achieves state-of-the-art audio SSL with structured masking and lower memory cost
AudioMosaic is a contrastive self-supervised audio encoder that uses structured time-frequency masking on spectrograms to reduce memory overhead and achieve state-of-the-art results on standard audio benchmarks.
AudioMosaic, a contrastive self-supervised learning encoder introduced in a new arXiv preprint, applies structured time-frequency masking to spectrogram patches during pre-training. The approach constructs positive pairs by masking portions of the time-frequency grid, which reduces memory consumption and enables efficient large-batch training—a persistent challenge for contrastive methods in audio.
Audio self-supervised learning has been dominated by generative reconstruction objectives in recent years, while contrastive approaches have remained less explored, partly due to the difficulty of designing effective audio augmentations and the computational cost of large batch sizes. AudioMosaic addresses both issues by working directly on spectrogram representations and applying structured masking patterns that create meaningful positive pairs without the memory overhead of traditional contrastive frameworks. The encoder learns discriminative utterance-level representations that transfer well across datasets, domains, and acoustic conditions—a key advantage for practitioners working with audio from varied sources like speech, music, and environmental sound.
When integrated into audio-language models, the pretrained AudioMosaic encoder improves performance on multimodal audio-language tasks, suggesting the learned representations capture semantic structure useful beyond purely acoustic classification. The authors demonstrate state-of-the-art performance on several standard audio benchmarks under both linear probing and fine-tuning protocols. Code and weights are available on GitHub; the full paper is arXiv:2605.14231v1.
