VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias

Multimedia content has become ubiquitous on social media platforms, leading to the rise of multimodal misinformation (MM) and the urgent need for effective strategies to detect and prevent its spread. In recent years, the challenge of multimodal misinformation detection (MMD) has garnered significant attention by researchers and has mainly involved the creation of annotated, weakly annotated, or synthetically generated training datasets, along with the development of various deep learning MMD models. However, the problem of unimodal bias has been overlooked, where specific patterns and biases in MMD benchmarks can result in biased or unimodal models outperforming their multimodal counterparts on an inherently multimodal task; making it difficult to assess progress. In this study, we systematically investigate and identify the presence of unimodal bias in widely-used MMD benchmarks, namely VMU-Twitter and COSMOS. To address this issue, we introduce the "VERification of Image-TExt pairs" (VERITE) benchmark for MMD which incorporates real-world data, excludes "asymmetric multimodal misinformation" and utilizes "modality balancing". We conduct an extensive comparative study with a Transformer-based architecture that shows the ability of VERITE to effectively address unimodal bias, rendering it a robust evaluation framework for MMD. Furthermore, we introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data that preserve crossmodal relations between legitimate images and false human-written captions. By leveraging CHASMA in the training process, we observe consistent and notable improvements in predictive performance on VERITE; with a 9.2% increase in accuracy. We release our code at: https://github.com/stevejpapad/image-text-verification

翻译：多媒体内容在社交媒体平台上已变得无处不在，导致多模态虚假信息（MM）的激增，以及对其检测和传播遏制策略的迫切需求。近年来，多模态虚假信息检测（MMD）挑战引起了研究者的广泛关注，主要涉及创建带注释、弱注释或合成生成的训练数据集，以及开发各种深度学习MMD模型。然而，单模态偏差问题被忽视了，即MMD基准中的特定模式和偏差可能导致有偏模型或单模态模型在固有是多模态任务上的表现优于其多模态对应模型，使得评估进展变得困难。在本研究中，我们系统性地探究并识别了广泛使用的MMD基准（即VMU-Twitter和COSMOS）中存在的单模态偏差。为解决此问题，我们引入了"图像-文本对验证"（VERITE）基准，该基准融合真实世界数据、排除"非对称多模态虚假信息"并采用"模态平衡"方法。我们基于Transformer架构开展了广泛的比较研究，表明VERITE能有效应对单模态偏差，使其成为MMD的鲁棒评估框架。此外，我们引入了一种新方法——称为跨模态硬合成错位（CHASMA），用于生成保留合法图像与虚假人工标注标题间跨模态关系的真实合成训练数据。通过在训练过程中利用CHASMA，我们观察到在VERITE上的预测性能持续且显著提升，准确率提高9.2%。我们已在https://github.com/stevejpapad/image-text-verification 发布代码。