Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. Previous studies have noted a substantial performance disparity between self-supervised and fully supervised approaches. In this paper, we propose an effective Self-Distillation network with Ensemble Prototypes (SDEP) to facilitate self-supervised speaker representation learning. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the SDEP framework in speaker verification. SDEP achieves a new SOTA on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate 1.94\%, 1.99\%, and 3.77\% for trial Vox1-O, Vox1-E and Vox1-H , respectively), discarding any speaker labels in the training phase. Code will be publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
翻译:训练无需说话人标签的说话人判别性和鲁棒说话人验证系统仍具挑战性且值得探索。先前研究指出,自监督方法与全监督方法之间存在显著的性能差距。本文提出一种有效的集成原型自蒸馏网络(SDEP),以促进自监督说话人表示学习。在VoxCeleb数据集上进行的一系列实验表明,SDEP框架在说话人验证任务中具有优越性。SDEP在VoxCeleb1说话人验证评估基准上达到了新的最优水平(即在Vox1-O、Vox1-E和Vox1-H测试集上等错误率分别为1.94%、1.99%和3.77%),且在训练阶段完全摒弃了说话人标签。代码将公开于https://github.com/alibaba-damo-academy/3D-Speaker。