AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes

We propose a method named AudioFormer,which learns audio feature representations through the acquisition of discrete acoustic codes and subsequently fine-tunes them for audio classification tasks. Initially,we introduce a novel perspective by considering the audio classification task as a form of natural language understanding (NLU). Leveraging an existing neural audio codec model,we generate discrete acoustic codes and utilize them to train a masked language model (MLM),thereby obtaining audio feature representations. Furthermore,we pioneer the integration of a Multi-Positive sample Contrastive (MPC) learning approach. This method enables the learning of joint representations among multiple discrete acoustic codes within the same audio input. In our experiments,we treat discrete acoustic codes as textual data and train a masked language model using a cloze-like methodology,ultimately deriving high-quality audio representations. Notably,the MPC learning technique effectively captures collaborative representations among distinct positive samples. Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models across multiple datasets,and even outperforms audio-visual multimodal classification models on select datasets. Specifically,our approach achieves remarkable results on datasets including AudioSet (2M,20K),and FSD50K,with performance scores of 53.9,45.1,and 65.6,respectively. We have openly shared both the code and models: https://github.com/LZH-0225/AudioFormer.git.

翻译：我们提出了一种名为AudioFormer的方法，该方法通过获取离散声学编码来学习音频特征表示，并随后对其进行微调以用于音频分类任务。首先，我们引入了一个新颖的视角，将音频分类任务视为一种自然语言理解（NLU）。利用现有的神经音频编解码器模型，我们生成离散声学编码，并使用它们来训练掩码语言模型（MLM），从而获得音频特征表示。此外，我们率先引入了多正样本对比学习（MPC）方法。该方法能够学习同一音频输入内多个离散声学编码之间的联合表示。在我们的实验中，我们将离散声学编码视为文本数据，并采用完形填空式方法训练掩码语言模型，最终推导出高质量的音频表示。值得注意的是，MPC学习技术有效地捕捉了不同正样本之间的协作表示。我们的研究结果表明，与当前流行的单模态音频分类模型相比，AudioFormer在多个数据集上取得了显著更优的性能，甚至在某些数据集上超越了音频-视觉多模态分类模型。具体而言，我们的方法在AudioSet（2M、20K）和FSD50K等数据集上取得了令人瞩目的结果，性能得分分别为53.9、45.1和65.6。我们已公开共享了代码和模型：https://github.com/LZH-0225/AudioFormer.git。