With the rapid advancement of generative AI, multimodal deepfakes, which manipulate both audio and visual modalities, have drawn increasing public concern. Currently, deepfake detection has emerged as a crucial strategy in countering these growing threats. However, as a key factor in training and validating deepfake detectors, most existing deepfake datasets primarily focus on the visual modal, and the few that are multimodal employ outdated techniques, and their audio content is limited to a single language, thereby failing to represent the cutting-edge advancements and globalization trends in current deepfake technologies. To address this gap, we propose a novel, multilingual, and multimodal deepfake dataset: PolyGlotFake. It includes content in seven languages, created using a variety of cutting-edge and popular Text-to-Speech, voice cloning, and lip-sync technologies. We conduct comprehensive experiments using state-of-the-art detection methods on PolyGlotFake dataset. These experiments demonstrate the dataset's significant challenges and its practical value in advancing research into multimodal deepfake detection.
翻译:随着生成式AI的快速发展,多模态深度伪造技术(同时操纵音频与视觉模态)日益引发公众担忧。当前,深度伪造检测已成为应对这些新兴威胁的关键策略。然而,作为训练与验证深度伪造检测器的核心要素,现有深度伪造数据集大多仅聚焦于视觉模态,少数多模态数据集也沿用陈旧技术,且其音频内容局限于单一语言,未能体现当前深度伪造技术的前沿进展与全球化趋势。针对这一空白,我们提出一个新颖的多语言、多模态深度伪造数据集——PolyGlotFake。该数据集包含七种语言的内容,采用多种前沿且广受欢迎的文本转语音、语音克隆及唇形同步技术生成。我们利用最新检测方法在PolyGlotFake数据集上开展了全面实验,结果表明该数据集对多模态深度伪造检测研究具有显著挑战性与实践价值。