This paper presents that the masked-modeling principle driving the success of large foundational vision models can be effectively applied to audio by making predictions in a latent space. We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations. The target representations of those regions are extracted by the exponential moving average of context encoder, \emph{i.e.}, target encoder, on the whole spectrogram. We find it beneficial to transfer random block masking into time-frequency aware masking in a curriculum manner, considering the complexity of highly correlated in local time and frequency in audio spectrograms. To enhance contextual semantic understanding and robustness, we fine-tune the encoder with a regularized masking on target datasets, instead of input dropping or zero. Empirically, when built with Vision Transformers structure, we find A-JEPA to be highly scalable and sets new state-of-the-art performance on multiple audio and speech classification tasks, outperforming other recent models that use externally supervised pre-training.
翻译:本文提出,驱动大型基础视觉模型成功的掩码建模原则可以通过在潜在空间中进行预测,有效应用于音频领域。我们引入了基于音频的联合嵌入预测架构(A-JEPA),这是一种从音频频谱进行自监督学习的简单扩展方法。遵循I-JEPA的设计,我们的A-JEPA通过上下文编码器采用课程式掩码策略编码可见音频频谱图块,并预测精心设计位置区域的表示。这些区域的目标表示由上下文编码器的指数移动平均(即目标编码器)从整个频谱图中提取。考虑到音频频谱中局部时间和频率高度相关的复杂性,我们发现以课程方式将随机块掩码转移为时频感知掩码是有益的。为了增强上下文语义理解和鲁棒性,我们在目标数据集上使用正则化掩码对编码器进行微调,而非采用输入丢弃或置零。实验上,当基于Vision Transformers结构构建时,我们发现A-JEPA具有高度可扩展性,并在多个音频与语音分类任务上刷新了当前最先进性能,超越了其他使用外部监督预训练的最新模型。