Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets. To boost the system performance, researchers leverage large pretrained models such as WavLM to transfer learned high-level features to the downstream speaker recognition task. However, this approach introduces extra parameters as the pretrained model remains in the inference stage. Another group of researchers directly apply self-supervised methods such as DINO to speaker embedding learning, yet they have not explored its potential on large-scale in-the-wild datasets. In this paper, we present the effectiveness of DINO training on the large-scale WenetSpeech dataset and its transferability in enhancing the supervised system performance on the CNCeleb dataset. Additionally, we introduce a confidence-based data filtering algorithm to remove unreliable data from the pretraining dataset, leading to better performance with less training data. The associated pretrained models, confidence files, pretraining and finetuning scripts will be made available in the Wespeaker toolkit.
翻译:当前说话人识别系统主要依赖监督学习方法,其性能受限于标注数据集的规模。为提升系统表现,研究人员利用WavLM等大型预训练模型将学习到的高层特征迁移至下游说话人识别任务。然而,这种方法在推理阶段仍需保留预训练模型,导致额外参数引入。另一组研究人员将DINO等自监督方法直接应用于说话人嵌入学习,但尚未探索其在大规模野外数据集上的潜力。本文展示了在Wenetspeech大规模数据集上进行DINO训练的有效性,及其在提升CNCeleb数据集上监督系统性能的可迁移性。此外,我们提出了一种基于置信度的数据过滤算法,用于剔除预训练数据集中的不可靠数据,从而在使用更少训练数据的情况下获得更优性能。相关预训练模型、置信度文件、预训练与微调脚本将在Wespeaker工具包中开源。