Singing voice synthesis and singing voice conversion have significantly advanced, revolutionizing musical experiences. However, the rise of "Deepfake Songs" generated by these technologies raises concerns about authenticity. Unlike Audio DeepFake Detection (ADD), the field of song deepfake detection lacks specialized datasets or methods for song authenticity verification. In this paper, we initially construct a Chinese Fake Song Detection (FSD) dataset to investigate the field of song deepfake detection. The fake songs in the FSD dataset are generated by five state-of-the-art singing voice synthesis and singing voice conversion methods. Our initial experiments on FSD revealed the ineffectiveness of existing speech-trained ADD models for the task of Song DeepFake Detection. Thus, we employ the FSD dataset for the training of ADD models. We subsequently evaluate these models under two scenarios: one with the original songs and another with separated vocal tracks. Experiment results show that song-trained ADD models exhibit an approximate 38.58% reduction in average equal error rate compared to speech-trained ADD models on the FSD test set.
翻译:歌声合成与歌声转换技术取得了显著进展,革新了音乐体验。然而,这些技术生成的"深度伪造歌曲"引发了对真实性的担忧。与音频深度伪造检测不同,歌曲深度伪造检测领域缺乏专门的歌曲真实性验证数据集或方法。本文初步构建了一个中文假歌检测(FSD)数据集,以探索歌曲深度伪造检测领域。FSD数据集中的假歌由五种最先进的歌声合成与歌声转换方法生成。在FSD上的初步实验表明,现有基于语音训练的音频深度伪造检测模型在歌曲深度伪造检测任务中效果不佳。因此,我们利用FSD数据集对音频深度伪造检测模型进行训练,并在两种场景下评估这些模型:一种使用原始歌曲,另一种使用分离的人声音轨。实验结果表明,在FSD测试集上,经歌曲训练的音频深度伪造检测模型相比经语音训练的模型,平均等错误率降低了约38.58%。