Self-supervised speech models are a rapidly developing research topic in fake audio detection. Many pre-trained models can serve as feature extractors, learning richer and higher-level speech features. However,when fine-tuning pre-trained models, there is often a challenge of excessively long training times and high memory consumption, and complete fine-tuning is also very expensive. To alleviate this problem, we apply low-rank adaptation(LoRA) to the wav2vec2 model, freezing the pre-trained model weights and injecting a trainable rank-decomposition matrix into each layer of the transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared with fine-tuning with Adam on the wav2vec2 model containing 317M training parameters, LoRA achieved similar performance by reducing the number of trainable parameters by 198 times.
翻译:自监督语音模型是伪造音频检测领域中一个快速发展的研究方向。许多预训练模型可作为特征提取器,学习更丰富、更高层次的语音特征。然而,在对预训练模型进行微调时,常面临训练时间过长、内存消耗过高的问题,且完全微调的成本也十分昂贵。为缓解这一问题,我们将低秩适配方法应用于wav2vec2模型,冻结预训练模型权重,并在Transformer架构的每一层中注入可训练的低秩分解矩阵,从而大幅减少下游任务的可训练参数数量。与在包含3.17亿训练参数的wav2vec2模型上使用Adam优化器进行微调相比,LoRA在将可训练参数数量减少198倍的同时,实现了相似的性能。