Large-scale self-supervised Pre-Trained Models (PTMs) have shown significant improvements in the speaker verification (SV) task by providing rich feature representations. In this paper, we utilize w2v-BERT 2.0, a model with approximately 600 million parameters trained on 4.5 million hours of unlabeled data across 143 languages, for the SV task. The MFA structure with Layer Adapter is employed to process the multi-layer feature outputs from the PTM and extract speaker embeddings. Additionally, we incorporate LoRA for efficient fine-tuning. Our model achieves state-of-the-art results with 0.12% and 0.55% EER on the Vox1-O and Vox1-H test sets, respectively. Furthermore, we apply knowledge distillation guided structured pruning, reducing the model size by 80% while achieving only a 0.04% EER degradation. Source code and models are released at https://github.com/ZXHY-82/w2v-BERT-2.0_SV.
翻译:大规模自监督预训练模型通过提供丰富的特征表示,在说话人验证任务中展现出显著的性能提升。本文利用w2v-BERT 2.0模型进行说话人验证,该模型包含约6亿参数,并在涵盖143种语言的450万小时无标注数据上训练。我们采用带有层适配器的MFA结构来处理预训练模型的多层特征输出,并提取说话人嵌入。此外,我们整合了LoRA以实现高效微调。我们的模型在Vox1-O和Vox1-H测试集上分别取得了0.12%和0.55%的等错误率,达到了最先进的性能。进一步地,我们应用了知识蒸馏引导的结构化剪枝,在模型尺寸减小80%的同时,仅带来0.04%的等错误率性能下降。源代码与模型已发布于 https://github.com/ZXHY-82/w2v-BERT-2.0_SV。