LookupForensics: A Large-Scale Multi-Task Dataset for Multi-Phase Image-Based Fact Verification

Amid the proliferation of forged images, notably the tsunami of deepfake content, extensive research has been conducted on using artificial intelligence (AI) to identify forged content in the face of continuing advancements in counterfeiting technologies. We have investigated the use of AI to provide the original authentic image after deepfake detection, which we believe is a reliable and persuasive solution. We call this "image-based automated fact verification," a name that originated from a text-based fact-checking system used by journalists. We have developed a two-phase open framework that integrates detection and retrieval components. Additionally, inspired by a dataset proposed by Meta Fundamental AI Research, we further constructed a large-scale dataset that is specifically designed for this task. This dataset simulates real-world conditions and includes both content-preserving and content-aware manipulations that present a range of difficulty levels and have potential for ongoing research. This multi-task dataset is fully annotated, enabling it to be utilized for sub-tasks within the forgery identification and fact retrieval domains. This paper makes two main contributions: (1) We introduce a new task, "image-based automated fact verification," and present a novel two-phase open framework combining "forgery identification" and "fact retrieval." (2) We present a large-scale dataset tailored for this new task that features various hand-crafted image edits and machine learning-driven manipulations, with extensive annotations suitable for various sub-tasks. Extensive experimental results validate its practicality for fact verification research and clarify its difficulty levels for various sub-tasks.

翻译：随着伪造图像（尤其是深度伪造内容的泛滥）的激增，面对伪造技术的持续发展，学界已广泛开展利用人工智能识别伪造内容的研究。我们探索了在深度伪造检测后利用人工智能提供原始真实图像的方法，认为这是一种可靠且具说服力的解决方案。我们将此称为“基于图像的自动化事实核查”，该名称源于新闻工作者使用的基于文本的事实核查系统。我们开发了一个集成检测与检索组件的两阶段开放框架。此外，受Meta基础人工智能研究团队提出的数据集启发，我们进一步构建了专门为此任务设计的大规模数据集。该数据集模拟真实场景，包含内容保持型与内容感知型篡改操作，涵盖不同难度层级，具备持续研究潜力。这个多任务数据集经过完整标注，可应用于伪造识别与事实检索领域内的子任务。本文主要有两项贡献：（1）我们提出“基于图像的自动化事实核查”新任务，并构建了融合“伪造识别”与“事实检索”的新型两阶段开放框架。（2）我们发布了专为此新任务定制的大规模数据集，其包含各类手工图像编辑与机器学习驱动的篡改操作，配备适用于多种子任务的详尽标注。大量实验结果验证了该数据集在事实核查研究中的实用性，并明确了各子任务的难度层级。

相关内容