Deep speaker models yield low error rates in speaker verification. Nonetheless, the high performance tends to be exchanged for model size and computation time, making these models challenging to run under limited conditions. We focus on small-footprint deep speaker embedding extraction, leveraging knowledge distillation. While prior work on this topic has addressed speaker embedding extraction at the utterance level, we propose to combine embeddings from various levels of the x-vector model (teacher network) to train small-footprint student networks. Results indicate the usefulness of frame-level information, with the student models being 85%-91% smaller than their teacher, depending on the size of the teacher embeddings. Concatenation of teacher embeddings results in student networks that reach comparable performance along with the teacher while utilizing a 75% relative size reduction from the teacher. The findings and analogies are furthered to other x-vector variants.
翻译:深度说话人模型在说话人验证中具有较低的误识率,然而,高性能往往以模型规模和计算时间为代价,使得这些模型在受限条件下难以运行。本文聚焦于轻量化深度说话人嵌入提取,通过知识蒸馏技术实现。此前相关研究主要关注话语级别的说话人嵌入提取,我们提出结合x-vector模型(教师网络)中不同层级的嵌入信息,用于训练轻量级学生网络。实验结果表明,帧级信息具有实用价值:根据教师嵌入的不同规模,学生模型的参数量相比教师模型缩小85%-91%。当采用教师嵌入拼接策略时,学生网络在保持与教师相当性能的同时,可减少75%的相对模型尺寸。该发现及类比方法进一步推广至其他x-vector变体。