The application of speech self-supervised learning (SSL) models has achieved remarkable performance in speaker verification (SV). However, there is a computational cost hurdle in employing them, which makes development and deployment difficult. Several studies have simply compressed SSL models through knowledge distillation (KD) without considering the target task. Consequently, these methods could not extract SV-tailored features. This paper suggests One-Step Knowledge Distillation and Fine-Tuning (OS-KDFT), which incorporates KD and fine-tuning (FT). We optimize a student model for SV during KD training to avert the distillation of inappropriate information for the SV. OS-KDFT could downsize Wav2Vec 2.0 based ECAPA-TDNN size by approximately 76.2%, and reduce the SSL model's inference time by 79% while presenting an EER of 0.98%. The proposed OS-KDFT is validated across VoxCeleb1 and VoxCeleb2 datasets and W2V2 and HuBERT SSL models. Experiments are available on our GitHub.
翻译:语音自监督学习(SSL)模型在说话人验证(SV)中取得了显著性能。然而,其应用存在计算成本障碍,导致开发与部署困难。多项研究仅通过知识蒸馏(KD)压缩SSL模型而未考虑目标任务,因此这些方法无法提取针对SV优化的特征。本文提出一步式知识蒸馏与微调(OS-KDFT),该方法融合了KD与微调(FT)。我们在KD训练过程中针对SV优化学生模型,以避免蒸馏出对SV不相关的信息。OS-KDFT可将基于Wav2Vec 2.0的ECAPA-TDNN模型规模缩小约76.2%,并将SSL模型推理时间减少79%,同时实现0.98%的等错误率(EER)。所提出的OS-KDFT在VoxCeleb1、VoxCeleb2数据集以及W2V2、HuBERT SSL模型上得到验证。实验代码已发布于我们的GitHub仓库。