Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation

Detecting and grounding multi-modal media manipulation (DGM^4) has become increasingly crucial due to the widespread dissemination of face forgery and text misinformation. In this paper, we present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM^4 problem. Unlike previous state-of-the-art methods that solely focus on the image (RGB) domain to describe visual forgery features, we additionally introduce the frequency domain as a complementary viewpoint. By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts. Then, our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands. Moreover, to address the semantic conflicts between image and frequency domains, the forgery-aware mutual module is developed to further enable the effective interaction of disparate image and frequency features, resulting in aligned and comprehensive visual forgery representations. Finally, based on visual and textual forgery features, we propose a unified decoder that comprises two symmetric cross-modal interaction modules responsible for gathering modality-specific forgery information, along with a fusing interaction module for aggregation of both modalities. The proposed unified decoder formulates our UFAFormer as a unified framework, ultimately simplifying the overall architecture and facilitating the optimization process. Experimental results on the DGM^4 dataset, containing several perturbations, demonstrate the superior performance of our framework compared to previous methods, setting a new benchmark in the field.

翻译：多模态媒体篡改检测与定位（DGM⁴）因面部伪造与文本虚假信息的广泛传播而日益关键。本文提出统一频率辅助变换器框架（UFAFormer）以解决DGM⁴问题。与以往仅关注图像（RGB）域描述视觉伪造特征的最先进方法不同，我们额外引入频域作为互补视角。通过离散小波变换，将图像分解为多个频率子带，捕获丰富的面部伪造痕迹。随后，所提出的频率编码器结合频带内与频带间自注意力机制，显式聚合不同子带内及跨子带的伪造特征。此外，为缓解图像域与频域间的语义冲突，开发了伪造感知互模块，进一步实现图像与频率异构特征的有效交互，形成对齐且全面的视觉伪造表征。最终，基于视觉与文本伪造特征，提出统一解码器，包含两个对称的跨模态交互模块（负责收集模态特定伪造信息）与一个融合交互模块（用于双模态聚合）。该统一解码器将UFAFormer构建为统一框架，简化整体架构并优化学习过程。在包含多种扰动的DGM⁴数据集上的实验表明，本框架相比以往方法具有更优性能，创下该领域新基准。