AI-synthesized text and images have gained significant attention, particularly due to the widespread dissemination of multi-modal manipulations on the internet, which has resulted in numerous negative impacts on society. Existing methods for multi-modal manipulation detection and grounding primarily focus on fusing vision-language features to make predictions, while overlooking the importance of modality-specific features, leading to sub-optimal results. In this paper, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. To achieve this, we introduce visual/language pre-trained encoders and dual-branch cross-attention (DCA) to extract and fuse modality-unique features. Furthermore, we design decoupled fine-grained classifiers (DFC) to enhance modality-specific feature mining and mitigate modality competition. Moreover, we propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details. Extensive experiments on the $\rm DGM^4$ dataset demonstrate the superior performance of our proposed model compared to state-of-the-art approaches.
翻译:人工智能合成的文本和图像已引起广泛关注,尤其是互联网上多模态篡改内容的广泛传播给社会带来了诸多负面影响。现有的多模态篡改检测与定位方法主要侧重于融合视觉-语言特征进行预测,而忽视了模态特定特征的重要性,导致结果次优。本文构建了一个简单新颖的基于Transformer的多模态篡改检测与定位框架。该框架在保留多模态对齐能力的同时,同步探索模态特定特征。为此,我们引入了视觉/语言预训练编码器和双分支交叉注意力(DCA)来提取并融合模态特有特征。此外,我们设计了解耦细粒度分类器(DFC)以增强模态特定特征挖掘并缓解模态竞争。同时,我们提出了一种隐式篡改查询(IMQ),通过可学习查询自适应地聚合每个模态内的全局上下文线索,从而提升伪造细节的发现能力。在$\rm DGM^4$数据集上的大量实验表明,我们提出的模型相比现有最先进方法具有更优越的性能。