The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music or environmental sounds. To approach this problem, methods inspired by self-supervised models from NLP, like BERT, are often used and adapted to audio. These models rely on the discrete nature of text, hence adopting this type of approach for audio processing requires either a change in the learning objective or mapping the audio signal to a set of discrete classes. In this work, we explore the use of EnCodec, a neural audio codec, to generate discrete targets for learning an universal audio model based on a masked autoencoder (MAE). We evaluate this approach, which we call EncodecMAE, on a wide range of audio tasks spanning speech, music and environmental sounds, achieving performances comparable or better than leading audio representation models.
翻译:通用音频表示学习的目标是获得可用于语音、音乐或环境声音等多种下游任务的基础模型。为了解决这一问题,通常采用受NLP中自监督模型(如BERT)启发的方法,并将其适配至音频领域。这些模型依赖于文本的离散性质,因此将此类方法应用于音频处理时,需要调整学习目标或强制将音频信号映射至一组离散类别。本研究探索使用神经音频编解码器EnCodec生成离散目标,基于掩码自编码器(MAE)学习通用音频模型。我们将该方法命名为EncodecMAE,并在涵盖语音、音乐和环境声音的广泛音频任务上评估其性能,最终取得了与领先音频表示模型相当或更优的结果。