Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesized using state-of-the-art methods from publicly accessible singing voice datasets. CtrSVDD includes 47.64 hours of bonafide and 260.34 hours of deepfake singing vocals, spanning 14 deepfake methods and involving 164 singer identities. We also present a baseline system with flexible front-end features, evaluated against a structured train/dev/eval split. The experiments show the importance of feature selection and highlight a need for generalization towards deepfake methods that deviate further from training distribution. The CtrSVDD dataset and baselines are publicly accessible.
翻译:近年来歌声合成与转换技术的进步,使得构建鲁棒的歌声深度伪造检测模型变得至关重要。现有的SVDD数据集因可控性有限、深度伪造方法多样性不足以及许可限制而面临挑战。为弥补这些不足,我们引入了CtrSVDD,这是一个大规模、多样化的真实歌声与深度伪造歌声数据集。这些歌声样本是利用最先进的方法,从公开可获取的歌声数据集中合成而来。CtrSVDD包含47.64小时的真实歌声和260.34小时的深度伪造歌声,涵盖了14种深度伪造方法,涉及164位歌手身份。我们还提出了一个采用灵活前端特征的基线系统,并按照结构化的训练集/开发集/评估集划分进行了评估。实验结果表明了特征选择的重要性,并凸显了模型对于进一步偏离训练分布的深度伪造方法进行泛化的必要性。CtrSVDD数据集及基线模型均已公开提供。