The advent of hyper-scale and general-purpose pre-trained models is shifting the paradigm of building task-specific models for target tasks. In the field of audio research, task-agnostic pre-trained models with high transferability and adaptability have achieved state-of-the-art performances through fine-tuning for downstream tasks. Nevertheless, re-training all the parameters of these massive models entails an enormous amount of time and cost, along with a huge carbon footprint. To overcome these limitations, the present study explores and applies efficient transfer learning methods in the audio domain. We also propose an integrated parameter-efficient tuning (IPET) framework by aggregating the embedding prompt (a prompt-based learning approach), and the adapter (an effective transfer learning method). We demonstrate the efficacy of the proposed framework using two backbone pre-trained audio models with different characteristics: the audio spectrogram transformer and wav2vec 2.0. The proposed IPET framework exhibits remarkable performance compared to fine-tuning method with fewer trainable parameters in four downstream tasks: sound event classification, music genre classification, keyword spotting, and speaker verification. Furthermore, the authors identify and analyze the shortcomings of the IPET framework, providing lessons and research directions for parameter efficient tuning in the audio domain.
翻译:超大规模通用预训练模型的出现正改变着为目标任务构建专用模型的范式。在音频研究领域,具有高迁移性和适应性的任务无关预训练模型通过对下游任务的微调,已取得当前最优性能。然而,重新训练这些庞大模型的所有参数需要耗费大量时间和成本,同时产生巨大的碳足迹。为克服这些限制,本研究探索并应用了音频领域的高效迁移学习方法。我们进一步提出一种集成参数高效调优(IPET)框架,通过聚合嵌入提示(一种基于提示的学习方法)和适配器(一种有效的迁移学习方法)实现。我们采用两种具有不同特性的骨干预训练音频模型(音频频谱图变换器与wav2vec 2.0)验证了所提框架的有效性。在声音事件分类、音乐流派分类、关键词识别和说话人验证四项下游任务中,所提出的IPET框架在可训练参数更少的情况下,展现出与微调方法相比更优越的性能。此外,作者识别并分析了IPET框架的局限性,为音频领域的参数高效调优提供了经验教训与研究方向。