In this paper, we contribute a novel and extensive dataset for speaker verification, which contains noisy 38k identities/1.45M utterances (VoxBlink) and relatively cleaned 18k identities/1.02M (VoxBlink-Clean) utterances for training. Firstly, we accumulate a 60K+ users' list with their avatars and download their short videos on YouTube. We then established an automatic and scalable pipeline to extract relevant speech and video segments from these videos. To our knowledge, the VoxBlink dataset is one of the largest speaker recognition datasets available. Secondly, we conduct a series of experiments based on different backbones trained on a mix of the VoxCeleb2 and the VoxBlink-Clean. Our findings highlight a notable performance improvement, ranging from 13% to 30%, across different backbone architectures upon integrating our dataset for training. The dataset will be made publicly available shortly.
翻译:本文提出了一种新颖且大规模的说话人验证数据集,包含含噪的38K身份/1.45M条语音(VoxBlink)以及相对清洁的18K身份/1.02M条语音(VoxBlink-Clean)用于训练。首先,我们收集了超过6万个用户的头像列表,并下载其在YouTube上的短视频。随后,我们建立了一个自动化的可扩展流水线,从这些视频中提取相关的语音和视频片段。据我们所知,VoxBlink数据集是当前最大的说话人识别数据集之一。其次,我们基于不同骨干网络,使用VoxCeleb2与VoxBlink-Clean的混合数据进行了一系列实验。结果表明,在整合我们的数据集进行训练后,不同骨干架构的性能提升幅度达13%至30%。该数据集将很快公开提供。