Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress. However, the extensive computational demands during pre-training pose a significant barrier to the potential application and optimization of audio SSL models. In this paper, inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) to further improve the effectiveness and efficiency in audio SSL. The proposed EAT adopts the bootstrap self-supervised training paradigm to the audio domain. A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events. Furthermore, we reveal that the masking strategy is critical in audio SSL pre-training, and superior audio representations can be obtained with large inverse block masks. Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, along with a significant pre-training speedup up to ~15x compared to existing audio SSL models.
翻译:音频自监督学习(SSL)预训练旨在从无标注音频中学习优质表征,已取得显著进展。然而,预训练过程中的巨大计算需求严重制约了音频SSL模型的潜在应用与优化。受图像模态中data2vec 2.0与音频模态中Audio-MAE成功的启发,本文提出高效音频Transformer(EAT)以进一步提升音频SSL的有效性与效率。该方法将自举式自监督训练范式引入音频领域,并设计了一种新颖的语句-帧目标函数(UFO)以增强声学事件建模能力。此外,我们揭示了掩码策略在音频SSL预训练中的关键作用,通过大尺寸逆块掩码可获得更优的音频表征。实验结果表明,EAT在多项音频相关任务中取得了当前最优(SOTA)性能,包括AudioSet(AS-2M、AS-20K)、ESC-50和SPC-2基准,同时相比现有音频SSL模型实现了约15倍的预训练加速。