Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification

Automatic speaker verification task has made great achievements using deep learning approaches with the large-scale manually annotated dataset. However, it's very difficult and expensive to collect a large amount of well-labeled data for system building. In this paper, we propose a novel and advanced self-supervised learning framework which can construct a high performance speaker verification system without using any labeled data. To avoid the impact of false negative pairs, we adopt the self-distillation with no labels (DINO) framework as the initial model, which can be trained without exploiting negative pairs. Then, we introduce a cluster-aware training strategy for DINO to improve the diversity of data. In the iteration learning stage, due to a mass of unreliable labels from clustering, the quality of pseudo labels is important for the system training. This motivates us to propose dynamic loss-gate and label correction (DLG-LC) methods to alleviate the performance degradation caused by unreliable labels. More specifically, we model the loss distribution with GMM and obtain the loss-gate threshold dynamically to distinguish the reliable and unreliable labels. Besides, we adopt the model predictions to correct the unreliable label, for better utilizing the unreliable data rather than dropping them directly. Moreover, we extend the DLG-LC to multi-modality to further improve the performance. The experiments are performed on the commonly used Voxceleb dataset. Compared to the best-known self-supervised speaker verification system, our proposed method obtain 22.17%, 27.94% and 25.56% relative EER improvement on Vox-O, Vox-E and Vox-H test sets, even with fewer iterations, smaller models, and simpler clustering methods. More importantly, the newly proposed system even achieves comparable results with the fully supervised system, but without using any human labeled data.

翻译：自动说话人确认任务利用深度学习方法在大规模人工标注数据集上取得了重大进展。然而，收集大量高质量标注数据进行系统构建极为困难且成本高昂。本文提出一种新颖先进的自监督学习框架，无需使用任何标注数据即可构建高性能说话人确认系统。为避免假阴性样本对的影响，我们采用无标签自蒸馏（DINO）框架作为初始模型，该模型无需利用负样本对即可训练。随后，我们引入聚类感知训练策略改进DINO，以增强数据多样性。在迭代学习阶段，由于聚类产生的大量不可靠标签，伪标签质量对系统训练至关重要。这促使我们提出动态损失门控与标签校正（DLG-LC）方法，以缓解不可靠标签导致的性能退化。具体而言，我们采用高斯混合模型对损失分布建模并动态获取损失门控阈值，以区分可靠与不可靠标签。此外，我们利用模型预测结果校正不可靠标签，从而更充分地利用不可靠数据而非直接丢弃。同时，我们还将DLG-LC扩展至多模态以进一步提升性能。实验在常用的Voxceleb数据集上进行。与已知最优的自监督说话人确认系统相比，本文方法在Vox-O、Vox-E和Vox-H测试集上分别实现了22.17%、27.94%和25.56%的相对等错误率改善，且迭代次数更少、模型更小、聚类方法更简单。更重要的是，新提出的系统甚至取得了与全监督系统相当的结果，且全程未使用任何人标注数据。