The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on self-supervised learning for NLP, like BERT, or computer vision, like masked autoencoders (MAE), are often adapted to the audio domain. In this work, we propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments. The reconstruction is done by predicting the discrete units generated by EnCodec, a neural audio codec, from the unmasked inputs. We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds. Our best model outperforms various state-of-the-art audio representation models in terms of global performance. Additionally, we evaluate the resulting representations in the challenging task of automatic speech recognition (ASR), obtaining decent results and paving the way for a universal audio representation.
翻译:通用音频表示学习的目标是获得可用于语音、音乐和环境声音等多种下游任务的基础模型。为解决这一问题,受自然语言处理中的自监督学习方法(如BERT)或计算机视觉中的掩码自编码器(MAE)启发的技术常被迁移至音频领域。本文提出对音频信号的表示进行掩码处理,并训练掩码自编码器(MAE)重建被遮挡的片段。重建过程通过预测由神经音频编解码器EnCodec生成的离散单元实现,这些单元源自未被掩码的输入。我们将该方法命名为EnCodecMAE,并在涉及语音、音乐和环境声音的广泛任务中对其进行评估。我们的最佳模型在全局性能上超越了多种先进的音频表示模型。此外,我们在自动语音识别(ASR)这一具有挑战性的任务中评估了所获得的表示,取得了可观的结果,为通用音频表示的发展奠定了基础。