Learning General Audio Representations with Large-Scale Training of Patchout Audio Transformers

from arxiv, will apear in HEAR: Holistic Evaluation of Audio Representations Proceedings of Machine Learning Research PMLR 166. Source code: https://github.com/kkoutini/passt_hear21

The success of supervised deep learning methods is largely due to their ability to learn relevant features from raw data. Deep Neural Networks (DNNs) trained on large-scale datasets are capable of capturing a diverse set of features, and learning a representation that can generalize onto unseen tasks and datasets that are from the same domain. Hence, these models can be used as powerful feature extractors, in combination with shallower models as classifiers, for smaller tasks and datasets where the amount of training data is insufficient for learning an end-to-end model from scratch. During the past years, Convolutional Neural Networks (CNNs) have largely been the method of choice for audio processing. However, recently attention-based transformer models have demonstrated great potential in supervised settings, outperforming CNNs. In this work, we investigate the use of audio transformers trained on large-scale datasets to learn general-purpose representations. We study how the different setups in these audio transformers affect the quality of their embeddings. We experiment with the models' time resolution, extracted embedding level, and receptive fields in order to see how they affect performance on a variety of tasks and datasets, following the HEAR 2021 NeurIPS challenge evaluation setup. Our results show that representations extracted by audio transformers outperform CNN representations. Furthermore, we will show that transformers trained on Audioset can be extremely effective representation extractors for a wide range of downstream tasks.

翻译：有监督深度学习方法成功的主要原因在于其能够从原始数据中学习相关特征。在大规模数据集上训练的深度神经网络能够捕获多样化的特征，并学习可泛化至同一领域内未见任务和数据集的表示。因此，这些模型可作为强大的特征提取器，与作为分类器的浅层模型结合，应用于训练数据量不足、无法从头训练端到端模型的小型任务和数据集。过去数年间，卷积神经网络一直是音频处理的主流方法。然而，近年来基于注意力机制的变换器模型在有监督场景中展现出巨大潜力，其性能已超越卷积神经网络。本研究探索利用大规模数据集训练的音频变换器学习通用表示，分析不同配置对嵌入质量的影响。我们基于HEAR 2021 NeurIPS挑战赛评估框架，通过调整模型的时间分辨率、嵌入层提取深度及感受野，考察这些因素在多样化任务和数据集上的表现。实验结果表明，音频变换器提取的表示在性能上优于卷积神经网络。此外，我们还将证明在AudioSet上训练的变换器可作为下游广泛任务中极具效力的表示提取器。