Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets. To boost the system performance, researchers leverage large pretrained models such as WavLM to transfer learned high-level features to the downstream speaker recognition task. However, this approach introduces extra parameters as the pretrained model remains in the inference stage. Another group of researchers directly apply self-supervised methods such as DINO to speaker embedding learning, yet they have not explored its potential on large-scale in-the-wild datasets. In this paper, we present the effectiveness of DINO training on the large-scale WenetSpeech dataset and its transferability in enhancing the supervised system performance on the CNCeleb dataset. Additionally, we introduce a confidence-based data filtering algorithm to remove unreliable data from the pretraining dataset, leading to better performance with less training data. The associated pretrained models, confidence files, pretraining and finetuning scripts will be made available in the Wespeaker toolkit.
翻译:当前说话人识别系统主要依赖监督式方法,其性能受限于标注数据集的规模。为提升系统性能,研究者借助如WavLM等大型预训练模型,将其学习到的高层特征迁移至下游说话人识别任务。然而,此类方法在推理阶段仍需保留预训练模型,从而引入额外参数量。另一部分研究者尝试将DINO等自监督方法直接应用于说话人嵌入学习,但尚未探索其在大规模野外数据集上的潜力。本文展示了在大型WenetSpeech数据集上进行DINO训练的有效性,以及其提升CNCeleb数据集上监督系统性能的可迁移性。此外,我们提出一种基于置信度的数据过滤算法,用于剔除预训练数据集中的不可靠数据,从而在减少训练数据的同时获得更优性能。相关预训练模型、置信度文件、预训练与微调脚本将在Wespeaker工具包中开源。