VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias

Multimedia content has become ubiquitous on social media platforms, leading to the rise of multimodal misinformation (MM) and the urgent need for effective strategies to detect and prevent its spread. In recent years, the challenge of multimodal misinformation detection (MMD) has garnered significant attention by researchers and has mainly involved the creation of annotated, weakly annotated, or synthetically generated training datasets, along with the development of various deep learning MMD models. However, the problem of unimodal bias in MMD benchmarks -- where biased or unimodal methods outperform their multimodal counterparts on an inherently multimodal task -- has been overlooked. In this study, we systematically investigate and identify the presence of unimodal bias in widely-used MMD benchmarks (VMU-Twitter, COSMOS), raising concerns about their suitability for reliable evaluation. To address this issue, we introduce the "VERification of Image-TExtpairs" (VERITE) benchmark for MMD which incorporates real-world data, excludes "asymmetric multimodal misinformation" and utilizes "modality balancing". We conduct an extensive comparative study with a Transformer-based architecture that shows the ability of VERITE to effectively address unimodal bias, rendering it a robust evaluation framework for MMD. Furthermore, we introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data that preserve crossmodal relations between legitimate images and false human-written captions. By leveraging CHASMA in the training process, we observe consistent and notable improvements in predictive performance on VERITE; with a 9.2% increase in accuracy. We release our code at: https://github.com/stevejpapad/image-text-verification

翻译：多媒体内容在社交媒体平台上已变得无处不在，这使得多模态虚假信息（MM）迅速蔓延，并迫切需要有效的策略来检测和阻止其传播。近年来，多模态虚假信息检测（MMD）这一挑战引起了研究人员的广泛关注，主要涉及创建带注释、弱注释或合成生成的训练数据集，以及开发各种深度学习MMD模型。然而，MMD基准中的单模态偏见问题——即本应处理多模态任务的基准中，偏颇的单模态方法表现优于多模态方法——却一直被忽视。在本研究中，我们系统地调查并揭示了广泛使用的MMD基准（VMU-Twitter、COSMOS）中存在单模态偏见，这引发了对这些基准是否适合进行可靠评估的担忧。为解决这一问题，我们引入了“图像文本对验证”（VERITE）基准用于MMD，该基准包含真实世界数据，排除了“非对称多模态虚假信息”，并采用了“模态平衡”。我们与基于Transformer的架构进行了广泛的比较研究，结果表明VERITE能够有效解决单模态偏见，从而成为MMD的稳健评估框架。此外，我们提出了一种新方法——称为跨模态硬合成错配（CHASMA）——用于生成保留真实图像与虚构人工编写标题之间跨模态关系的逼真合成训练数据。通过在训练过程中利用CHASMA，我们观察到在VERITE上的预测性能持续且显著提升，准确率提高了9.2%。我们的代码已公开发布在：https://github.com/stevejpapad/image-text-verification