Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example, typical audio contrastive learning uses temporal relationships among input sounds to create training signals, whereas some methods use a difference among input views created by data augmentations. However, these training signals do not provide information derived from the intact input sound, which we think is suboptimal for learning representation that describes the input as it is. In this paper, we seek to learn audio representations from the input itself as supervision using a pretext task of auto-encoding of masked spectrogram patches, Masked Spectrogram Modeling (MSM, a variant of Masked Image Modeling applied to audio spectrogram). To implement MSM, we use Masked Autoencoders (MAE), an image self-supervised learning method. MAE learns to efficiently encode the small number of visible patches into latent representations to carry essential information for reconstructing a large number of masked patches. While training, MAE minimizes the reconstruction error, which uses the input as training signal, consequently achieving our goal. We conducted experiments on our MSM using MAE (MSM-MAE) models under the evaluation benchmark of the HEAR 2021 NeurIPS Challenge. Our MSM-MAE models outperformed the HEAR 2021 Challenge results on seven out of 15 tasks (e.g., accuracies of 73.4% on CREMA-D and 85.8% on LibriCount), while showing top performance on other tasks where specialized models perform better. We also investigate how the design choices of MSM-MAE impact the performance and conduct qualitative analysis of visualization outcomes to gain an understanding of learned representations. We make our code available online.

翻译：最近一般用途的音频表示方式显示了各种音频任务的最新艺术表现。这些表现方式通过自我监督的学习方法进行预先训练,通过自我监督的学习方法来从输入中创建培训信号。例如,典型的音频对比学习使用输入之间的时间关系来创建培训信号,而有些方法则使用数据增强产生的输入观点的不同。然而,这些培训信号没有提供来自完整输入声音的信息,而我们认为这种声音并不最理想地用于学习描述当前输入的演示。在本文中,我们试图通过一个借口任务来学习投入本身的音频表达方式作为监督。这个借口任务就是自动编码隐藏的光谱谱仪、蒙蔽的光谱模型、蒙蔽的光谱建模模型(MSMM,一个适用于声音光学图像建模的变异模型),我们使用图像自动自动调自动校正的图像学习方法。 MAE学会有效地将少量可见的片段纳入潜伏演示中,以便为重建大量遮蔽的影带选择提供基本信息。在培训中,MAE 15级的图像分析中,同时利用我们测试的图像重建目标,我们20MA公司高级分析,然后进行我们的标准分析。