Identity-Driven Multimedia Forgery Detection via Reference Assistance

Recent advancements in technologies, such as the 'deepfake' technique, have paved the way for the generation of various media forgeries. In response to the potential hazards of these media forgeries, many researchers engage in exploring detection methods, increasing the demand for high-quality media forgery datasets. Despite this, existing datasets have certain limitations. Firstly, most of datasets focus on the manipulation of visual modality and usually lack diversity, as only a few forgery approaches are considered. Secondly, the quality of media is often inadequate in clarity and naturalness. Meanwhile, the size of the dataset is also limited. Thirdly, while many real-world forgeries are driven by identity, the identity information of the subject in media is frequently neglected. For detection, identity information could be an essential clue to boost accuracy. Moreover, official media concerning certain identities on the Internet can serve as prior knowledge, aiding both the audience and forgery detectors in determining the true identity. Therefore, we propose an identity-driven multimedia forgery dataset, IDForge, which contains 249,138 video shots. All video shots are sourced from 324 wild videos collected of 54 celebrities from the Internet. The fake video shots involve 9 types of manipulation across visual, audio and textual modalities. Additionally, IDForge provides extra 214,438 real video shots as a reference set for the 54 celebrities. Correspondingly, we design an effective multimedia detection network, Reference-assisted Multimodal Forgery Detection Network (R-MFDN). Through extensive experiments on the proposed dataset, we demonstrate the effectiveness of R-MFDN on the multimedia detection task.

翻译：近年来，以“深度伪造”技术为代表的技术进步催生了多种媒体伪造内容的生成。为应对这些媒体伪造的潜在危害，众多研究者致力于探索检测方法，从而提升了对高质量媒体伪造数据集的需求。然而，现有数据集存在若干局限性：首先，多数数据集聚焦于视觉模态的篡改，且因仅考虑少量伪造手段而缺乏多样性；其次，媒体内容的清晰度与自然性往往不足，数据集规模亦有限；再者，尽管许多真实场景中的伪造行为具有身份驱动性，但媒体中主体的身份信息常被忽视。在检测过程中，身份信息可成为提升准确性的关键线索。此外，互联网上关于特定身份的官方媒体内容可作为先验知识，帮助受众和伪造检测器判定真实身份。为此，我们提出面向身份驱动的多媒体伪造数据集IDForge，该数据集包含249,138个视频片段。所有视频片段均源自互联网上54位名人的324个野生视频。伪造视频片段涵盖视觉、音频和文本模态的9种篡改类型。同时，IDForge额外提供214,438个真实视频片段作为这54位名人的参考集。相应地，我们设计了有效的多媒体检测网络——参考辅助多模态伪造检测网络（R-MFDN）。通过在所提数据集上的大量实验，我们验证了R-MFDN在多模态检测任务中的有效性。