Even though deep speaker models have demonstrated impressive accuracy in speaker verification tasks, this often comes at the expense of increased model size and computation time, presenting challenges for deployment in resource-constrained environments. Our research focuses on addressing this limitation through the development of small footprint deep speaker embedding extraction using knowledge distillation. While previous work in this domain has concentrated on speaker embedding extraction at the utterance level, our approach involves amalgamating embeddings from different levels of the x-vector model (teacher network) to train a compact student network. The results highlight the significance of frame-level information, with the student models exhibiting a remarkable size reduction of 85%-91% compared to their teacher counterparts, depending on the size of the teacher embeddings. Notably, by concatenating teacher embeddings, we achieve student networks that maintain comparable performance to the teacher while enjoying a substantial 75% reduction in model size. These findings and insights extend to other x-vector variants, underscoring the broad applicability of our approach.
翻译:尽管深度说话人模型在说话人确认任务中展现出卓越的准确性,但这往往以模型尺寸增大和计算时间增加为代价,给资源受限环境下的部署带来了挑战。本研究聚焦于通过知识蒸馏开发小足迹深度说话人嵌入抽取方法,以解决这一局限。此前该领域的研究集中于话语级别的说话人嵌入抽取,而我们的方法将x-vector模型(教师网络)不同层级的嵌入进行融合,用于训练紧凑的学生网络。实验结果表明帧级信息具有关键作用:根据教师嵌入的尺寸差异,学生模型相比教师模型实现了85%-91%的显著尺寸缩减。值得注意的是,通过拼接教师嵌入,我们成功构建了学生网络,其在保持与教师相当性能的同时,模型尺寸大幅缩减75%。这些发现及结论可推广至其他x-vector变体,充分彰显了本方法的广泛适用性。