Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition

Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets. To boost the system performance, researchers leverage large pretrained models such as WavLM to transfer learned high-level features to the downstream speaker recognition task. However, this approach introduces extra parameters as the pretrained model remains in the inference stage. Another group of researchers directly apply self-supervised methods such as DINO to speaker embedding learning, yet they have not explored its potential on large-scale in-the-wild datasets. In this paper, we present the effectiveness of DINO training on the large-scale WenetSpeech dataset and its transferability in enhancing the supervised system performance on the CNCeleb dataset. Additionally, we introduce a confidence-based data filtering algorithm to remove unreliable data from the pretraining dataset, leading to better performance with less training data. The associated pretrained models, confidence files, pretraining and finetuning scripts will be made available in the Wespeaker toolkit.

翻译：当前说话人识别系统主要依赖监督式方法，其性能受限于标注数据集的规模。为提升系统性能，研究者借助如WavLM等大型预训练模型，将其学习到的高层特征迁移至下游说话人识别任务。然而，此类方法在推理阶段仍需保留预训练模型，从而引入额外参数量。另一部分研究者尝试将DINO等自监督方法直接应用于说话人嵌入学习，但尚未探索其在大规模野外数据集上的潜力。本文展示了在大型WenetSpeech数据集上进行DINO训练的有效性，以及其提升CNCeleb数据集上监督系统性能的可迁移性。此外，我们提出一种基于置信度的数据过滤算法，用于剔除预训练数据集中的不可靠数据，从而在减少训练数据的同时获得更优性能。相关预训练模型、置信度文件、预训练与微调脚本将在Wespeaker工具包中开源。

相关内容

声纹识别

关注 444

说话人识别（Speaker Recognition），或者称为声纹识别（Voiceprint Recognition, VPR），是根据语音中所包含的说话人个性信息，利用计算机以及现在的信息识别技术，自动鉴别说话人身份的一种生物特征识别技术。说话人识别研究的目的就是从语音中提取具有说话人表征性的特征，建立有效的模型和系统，实现自动精准的说话人鉴别。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日