Latent Reconstruction from Generated Data for Multimodal Misinformation Detection

Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image's origin, context, or meaning, poses a growing challenge in the digital age. Due to the scarcity of large-scale annotated datasets for multimodal misinformation detection (MMD), recent approaches rely on synthetic training data created via out-of-context pairings or named entity manipulations (e.g., altering names, dates, or locations). However, these often yield simplistic, unrealistic examples, which limits their utility as training examples. To address this, we introduce "MisCaption This!", a framework for generating high-fidelity synthetic miscaptioned datasets through Adversarial Prompting of Vision-Language Models (VLMs). Additionally, we introduce "Latent Multimodal Reconstruction" (LAMAR), a Transformer-based network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to guide detection. We explore various training strategies (end-to-end vs. large-scale pre-training) and integration mechanisms (direct, mask, gate, and attention). Extensive experiments show that models trained on "MisCaption This!" data generalize better to real-world misinformation, while LAMAR achieves new state-of-the-art on NewsCLIPpings, VERITE, and the newly introduced VERITE 24/25 benchmark; highlighting the efficacy of VLM-generated data and reconstruction-based networks for advancing MMD. Our code is available at https://github.com/stevejpapad/miscaptioned-image-reconstruction

翻译：多模态虚假信息（例如错误标注的图像，其中标题歪曲了图像的来源、背景或含义）在数字时代构成了日益严峻的挑战。由于用于多模态虚假信息检测的大规模标注数据集稀缺，近期方法依赖于通过上下文外配对或命名实体操作（例如，更改名称、日期或地点）创建的合成训练数据。然而，这些方法通常产生过于简单、不真实的示例，限制了其作为训练样本的效用。为解决此问题，我们提出了“MisCaption This!”，一个通过对抗性提示视觉-语言模型来生成高保真合成错误标注数据集的框架。此外，我们提出了“隐式多模态重构”，这是一个基于Transformer的网络，旨在重构真实标题的嵌入表示，为检测任务提供强有力的辅助信号。我们探索了多种训练策略（端到端与大规模预训练）和集成机制（直接、掩码、门控和注意力）。大量实验表明，在“MisCaption This!”数据上训练的模型能更好地泛化到现实世界的虚假信息，而LAMAR在NewsCLIPpings、VERITE以及新引入的VERITE 24/25基准测试中均达到了新的最先进水平；这凸显了VLM生成的数据和基于重构的网络在推进多模态虚假信息检测方面的有效性。我们的代码可在 https://github.com/stevejpapad/miscaptioned-image-reconstruction 获取。