In this paper, we contribute a novel and extensive dataset for speaker verification, which contains noisy 38k identities/1.45M utterances (VoxBlink) and relatively cleaned 18k identities/1.02M (VoxBlink-Clean) utterances for training. Firstly, we collect a 60K+ users' list as well as their avatar and download their SHORT videos on the YouTube. Then, an automatically pipeline is devised to extract target user's speech segments and videos, which is efficient and scalable. To the best of our knowledge, the VoxBlink dataset is the largest speaker recognition dataset. Secondly, we develop a series of experiments based on VoxBlink-clean together with VoxCeleb2. Our findings highlight a notable improvement in performance, ranging from 15% to 30%, across different backbone architectures, upon integrating our dataset for training. The dataset will be released SOON~.
翻译:本文贡献了一个新颖且大规模的说话人验证数据集,包含含噪的38k身份/1.45M条语音(VoxBlink)及相对清洁的18k身份/1.02M条语音(VoxBlink-Clean),用于模型训练。首先,我们收集了超过6万个用户列表及其头像,并在YouTube上下载其短视频。随后,设计了一套自动化流程,用于高效且可扩展地提取目标用户的语音片段和视频。据我们所知,VoxBlink是最大的说话人识别数据集。其次,基于VoxBlink-Clean和VoxCeleb2开展了一系列实验。结果表明,将我们的数据集加入训练后,不同骨干网络的性能均显著提升15%至30%。该数据集即将发布。