In this paper, we contribute a novel and extensive dataset for speaker verification, which contains noisy 38k identities/1.45M utterances (VoxSnap) and relatively cleaned 18k identities/1.02M (VoxSnap-Clean) utterances for training. Firstly, we collect a 60K+ users' list as well as their avatar and download their SHORT videos on the YouTube. Then, an automatically pipeline is devised to extract target user's speech segments and videos, which is efficient and scalable. To the best of our knowledge, the VoxSnap dataset is the largest speaker recognition dataset. Secondly, we develop a series of experiments based on VoxSnap-clean together with VoxCeleb2. Our findings highlight a notable improvement in performance, ranging from 15% to 30%, across different backbone architectures, upon integrating our dataset for training. The dataset will be released SOON~.
翻译:本文贡献了一个新颖且大规模的说话人验证数据集,其中包含含噪的38k个身份/145万条语音(VoxSnap)以及相对纯净的18k个身份/102万条语音(VoxSnap-Clean)用于训练。首先,我们收集了6万以上用户列表及其头像,并从YouTube下载其短视频。其次,设计了一个自动化的流水线来提取目标用户的语音片段与视频,该方案高效且可扩展。据我们所知,VoxSnap数据集是最大的说话人识别数据集。随后,我们基于VoxSnap-Clean与VoxCeleb2开展了一系列实验。研究发现,在融入本数据集进行训练后,不同骨干架构的性能均显著提升,幅度达15%至30%。该数据集即将发布。