Recently, Transformers have been introduced into the field of acoustics recognition. They are pre-trained on large-scale datasets using methods such as supervised learning and semi-supervised learning, demonstrating robust generality--It fine-tunes easily to downstream tasks and shows more robust performance. However, the predominant fine-tuning method currently used is still full fine-tuning, which involves updating all parameters during training. This not only incurs significant memory usage and time costs but also compromises the model's generality. Other fine-tuning methods either struggle to address this issue or fail to achieve matching performance. Therefore, we conducted a comprehensive analysis of existing fine-tuning methods and proposed an efficient fine-tuning approach based on Adapter tuning, namely AAT. The core idea is to freeze the audio Transformer model and insert extra learnable Adapters, efficiently acquiring downstream task knowledge without compromising the model's original generality. Extensive experiments have shown that our method achieves performance comparable to or even superior to full fine-tuning while optimizing only 7.118% of the parameters. It also demonstrates superiority over other fine-tuning methods.
翻译:近年来,Transformer已被引入声学识别领域。它们通过监督学习和半监督学习等方法在大规模数据集上进行预训练,展现出强大的泛化能力——能够轻松微调至下游任务,并表现出更稳健的性能。然而,当前主流微调方法仍是全参数微调,即在训练过程中更新所有参数。这不仅会带来显著的内存占用和时间成本,还会损害模型的泛化性。其他微调方法要么难以解决这一问题,要么无法达到匹配的性能。因此,我们对现有微调方法进行了全面分析,并提出了一种基于Adapter微调的高效方法,即AAT。其核心思想是冻结音频Transformer模型,并插入额外的可学习Adapter,从而在不损害模型原始泛化性的情况下高效获取下游任务知识。大量实验表明,我们的方法仅优化了7.118%的参数,即可实现与全参数微调相当甚至更优的性能,同时在其他微调方法中也展现出优越性。