Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. Previous studies have noted a substantial performance disparity between self-supervised and fully supervised approaches. In this paper, we propose an effective self-supervised distillation framework with a novel ensemble algorithm named Ensemble Distillation Network (EDN) to facilitate self-supervised speaker representation learning. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the EDN framework in speaker verification. EDN achieves a new SOTA on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate 1.94\%, 1.99\%, and 3.77\% for trial Vox1-O, Vox1-E and Vox1-H , respectively), discarding any speaker labels in the training phase. Code will be publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
翻译:在无需说话人标签的情况下训练具有说话人区分性和鲁棒性的说话人验证系统仍具挑战性且值得探索。先前研究指出,自监督方法与全监督方法之间存在显著的性能差距。本文提出一种有效的自监督蒸馏框架,结合名为集成蒸馏网络(EDN)的新型集成算法,以促进自监督说话人表征学习。在VoxCeleb数据集上开展的一系列实验表明,EDN框架在说话人验证任务中具有优越性。EDN在VoxCeleb1说话人验证评估基准上达到了新的最优性能(即Vox1-O、Vox1-E和Vox1-H测试集上的等错误率分别为1.94%、1.99%和3.77%),且训练阶段完全无需使用任何说话人标签。代码将开源至https://github.com/alibaba-damo-academy/3D-Speaker。