Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbf{AudioMosaic}, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state-of-the-art performance on several standard audio benchmarks under both linear probing and fine-tuning. We further show that integrating the pretrained AudioMosaic encoder into audio-language models improves performance on audio-language tasks. The code is publicly available in our \href{https://github.com/HanxunH/AudioMosaic}{GitHub repository}.
翻译:音频自监督学习旨在从大规模未标注音频数据中学习通用表示。尽管近期进展主要由生成式重建目标驱动,但对比方法仍较少被探索,部分原因在于设计有效音频数据增强的困难以及对比预训练所需的大批量数据。我们提出**AudioMosaic**,一种基于对比学习的音频编码器,用于通用音频理解。在预训练过程中,AudioMosaic通过对语谱图块施加结构化时频掩蔽来构建正样本对,从而降低内存占用并实现高效的大批量训练。与生成式方法相比,AudioMosaic编码器能学习更具判别力的语句级表示,这些表示在跨数据集、域和声学条件下展现出强大的迁移能力。大量实验表明,AudioMosaic在线性探测和微调两种评估方式下,在多个标准音频基准测试中均达到最先进性能。我们进一步证明,将预训练的AudioMosaic编码器集成到音频-语言模型中可提升音频-语言任务的性能。代码已公开在GitHub仓库中。