With excellent generalization ability, self-supervised speech models have shown impressive performance on various downstream speech tasks in the pre-training and fine-tuning paradigm. However, as the growing size of pre-trained models, fine-tuning becomes practically unfeasible due to heavy computation and storage overhead, as well as the risk of overfitting. Adapters are lightweight modules inserted into pre-trained models to facilitate parameter-efficient adaptation. In this paper, we propose an effective adapter framework designed for adapting self-supervised speech models to the speaker verification task. With a parallel adapter design, our proposed framework inserts two types of adapters into the pre-trained model, allowing the adaptation of latent features within intermediate Transformer layers and output embeddings from all Transformer layers. We conduct comprehensive experiments to validate the efficiency and effectiveness of the proposed framework. Experimental results on the VoxCeleb1 dataset demonstrate that the proposed adapters surpass fine-tuning and other parameter-efficient transfer learning methods, achieving superior performance while updating only 5% of the parameters.
翻译:凭借出色的泛化能力,自监督语音模型在预训练与微调范式下,已展现出针对多种下游语音任务的卓越性能。然而,随着预训练模型规模的不断增大,微调在计算与存储开销以及过拟合风险方面变得几乎不可行。适配器是一种插入预训练模型的轻量级模块,用于实现参数高效的迁移学习。本文提出了一种有效的适配器框架,专门用于将自监督语音模型适配至说话人验证任务。该框架采用并行适配器设计,在预训练模型中插入两类适配器,实现了对中间Transformer层内隐层特征以及所有Transformer层输出的嵌入表示的双重适配。我们通过全面的实验验证了所提框架的效率与有效性。在VoxCeleb1数据集上的实验结果表明,所提出的适配器在仅更新5%参数的情况下,超越了微调及其他参数高效的迁移学习方法,取得了更优的性能。