The task of deepfakes detection is far from being solved by speech or vision researchers. Several publicly available databases of fake synthetic video and speech were built to aid the development of detection methods. However, existing databases typically focus on visual or voice modalities and provide no proof that their deepfakes can in fact impersonate any real person. In this paper, we present the first realistic audio-visual database of deepfakes SWAN-DF, where lips and speech are well synchronized and video have high visual and audio qualities. We took the publicly available SWAN dataset of real videos with different identities to create audio-visual deepfakes using several models from DeepFaceLab and blending techniques for face swapping and HiFiVC, DiffVC, YourTTS, and FreeVC models for voice conversion. From the publicly available speech dataset LibriTTS, we also created a separate database of only audio deepfakes LibriTTS-DF using several latest text to speech methods: YourTTS, Adaspeech, and TorToiSe. We demonstrate the vulnerability of a state of the art speaker recognition system, such as ECAPA-TDNN-based model from SpeechBrain, to the synthetic voices. Similarly, we tested face recognition system based on the MobileFaceNet architecture to several variants of our visual deepfakes. The vulnerability assessment show that by tuning the existing pretrained deepfake models to specific identities, one can successfully spoof the face and speaker recognition systems in more than 90% of the time and achieve a very realistic looking and sounding fake video of a given person.
翻译:深度伪造检测任务远未得到语音或视觉研究者的解决。为辅助检测方法开发,已构建了多个公开的虚假合成视频与语音数据库。然而,现有数据库通常仅聚焦于视觉或语音模态,且无法证明其深度伪造内容能真正冒充任何真实个体。本文提出了首个逼真的音视频深度伪造数据库SWAN-DF,其中唇部动作与语音高度同步,视频具有高视觉与音频质量。我们利用公开的SWAN数据集(包含不同身份的真实视频),采用DeepFaceLab中的多个模型与融合技术进行换脸,并通过HiFiVC、DiffVC、YourTTS及FreeVC模型实现语音转换,从而生成音视频深度伪造内容。同时,基于公开语音数据集LibriTTS,我们使用YourTTS、Adaspeech和TorToiSe等最新文本转语音方法,构建了纯音频深度伪造数据库LibriTTS-DF。我们证明了基于SpeechBrain中ECAPA-TDNN模型的先进说话人识别系统对合成语音的脆弱性。类似地,我们测试了基于MobileFaceNet架构的人脸识别系统对抗多种视觉深度伪造变体的表现。脆弱性评估表明,通过将现有预训练深度伪造模型调整为特定身份,可在超过90%的情况下成功欺骗人脸与说话人识别系统,并获得给定个体极为逼真的假视频(兼具视觉与音频真实性)。