Detecting and Grounding Multi-Modal Media Manipulation and Beyond

Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM^4). DGM^4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content, which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM^4 dataset, where image-text pairs are manipulated by various approaches, with rich annotation of diverse manipulations. Moreover, we propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities. HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. To exploit more fine-grained contrastive learning for cross-modal semantic alignment, we further integrate Manipulation-Aware Contrastive Loss with Local View and construct a more advanced model HAMMER++. Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of HAMMER and HAMMER++.

翻译：虚假信息已成为一个紧迫问题。网络上充斥着视觉和文本形式的伪造媒体。尽管已有多种深度伪造检测和文本假新闻检测方法被提出，但它们仅基于二元分类针对单一模态的伪造设计，更不用说跨模态分析和推理微妙的伪造痕迹。本文提出一个多模态伪造媒体的新研究问题，即检测与定位多模态媒体篡改（DGM^4）。DGM^4不仅旨在检测多模态媒体的真实性，还要求定位被篡改的内容，这需要对多模态媒体篡改进行更深层次的推理。为支持大规模研究，我们构建了首个DGM^4数据集，其中图像-文本对通过多种方法被篡改，并带有丰富的篡改标注。此外，我们提出了一种新颖的分层多模态篡改推理Transformer（HAMMER），以充分捕捉不同模态间的细粒度交互。HAMMER执行：1）两个单模态编码器间的感知篡改对比学习作为浅层篡改推理，以及2）通过多模态聚合器实现的模态感知交叉注意力作为深层篡改推理。基于交互的多模态信息，从浅层到深层集成了专用的篡改检测和定位头部。为了利用更细粒度的对比学习实现跨模态语义对齐，我们进一步整合了带局部视图的感知篡改对比损失，构建了更先进的模型HAMMER++。最后，我们为这一新研究问题构建了广泛的基准测试并设置了严格的评估指标。综合实验证明了HAMMER和HAMMER++的优越性。