Training robust speaker verification systems without speaker labels has long been a challenging task. Previous studies observed a large performance gap between self-supervised and fully supervised methods. In this paper, we apply a non-contrastive self-supervised learning framework called DIstillation with NO labels (DINO) and propose two regularization terms applied to embeddings in DINO. One regularization term guarantees the diversity of the embeddings, while the other regularization term decorrelates the variables of each embedding. The effectiveness of various data augmentation techniques are explored, on both time and frequency domain. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the regularized DINO framework in speaker verification. Our method achieves the state-of-the-art speaker verification performance under a single-stage self-supervised setting on VoxCeleb. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
翻译:多年来,在不使用说话人标签的情况下训练鲁棒的说话人验证系统一直是一项具有挑战性的任务。先前的研究观察到自监督方法与全监督方法之间存在较大的性能差距。本文采用一种名为无标签蒸馏(DINO)的非对比自监督学习框架,并针对DINO中的嵌入提出了两个正则化项。一个正则化项保证嵌入的多样性,另一个正则化项则对各嵌入变量进行去相关处理。我们还探索了多种数据增强技术在时域和频域上的有效性。基于VoxCeleb数据集的一系列实验表明,正则化DINO框架在说话人验证中具有优越性能。我们的方法在VoxCeleb单阶段自监督设置下达到了当前最优的说话人验证性能。代码已公开于https://github.com/alibaba-damo-academy/3D-Speaker。