Multimedia content has become ubiquitous on social media platforms, leading to the rise of multimodal misinformation and the urgent need for effective strategies to detect and prevent its spread. This study focuses on CrossModal Misinformation (CMM) where image-caption pairs work together to spread falsehoods. We contrast CMM with Asymmetric Multimodal Misinformation (AMM), where one dominant modality propagates falsehoods while other modalities have little or no influence. We show that AMM adds noise to the training and evaluation process while exacerbating the unimodal bias, where text-only or image-only detectors can seemingly outperform their multimodal counterparts on an inherently multimodal task. To address this issue, we collect and curate FIGMENTS, a robust evaluation benchmark for CMM, which consists of real world cases of misinformation, excludes AMM and utilizes modality balancing to successfully alleviate unimodal bias. FIGMENTS also provides a first step towards fine-grained CMM detection by including three classes: truthful, out-of-context, and miscaptioned image-caption pairs. Furthermore, we introduce a method for generating realistic synthetic training data that maintains crossmodal relations between legitimate images and false human-written captions that we term Crossmodal HArd Synthetic MisAlignment (CHASMA). We conduct extensive comparative study using a Transformer-based architecture. Our results show that incorporating CHASMA in conjunction with other generated datasets consistently improved the overall performance on FIGMENTS in both binary (+6.26%) and multiclass settings (+15.8%).We release our code at: https://github.com/stevejpapad/figments-and-misalignments
翻译:多媒体内容在社交平台上已变得无处不在,导致多模态虚假信息的增加,亟需有效的策略来检测和阻止其传播。本研究聚焦于跨模态虚假信息(CMM),其中图文对共同散布虚假信息。我们将其与非对称多模态虚假信息(AMM)进行对比,后者中主导性模态传播虚假信息,而其他模态影响甚微。研究表明,AMM在训练和评估过程中引入噪声,同时加剧单模态偏差,即仅文本或仅图像的检测器在多模态任务上看似优于多模态检测器。为解决这一问题,我们收集并整理出FIGMENTS数据集,这是一个针对CMM的鲁棒评估基准,包含现实世界中的虚假信息案例,排除了AMM,并通过模态平衡成功缓解了单模态偏差。FIGMENTS还首次实现了细粒度CMM检测,包含三类图文对:真实、失实和标题错误。此外,我们提出了一种生成逼真合成训练数据的方法,该方法能保持合法图像与虚假人工撰写标题之间的跨模态关系,称之为跨模态硬合成错配(CHASMA)。我们使用基于Transformer的架构进行了广泛的比较研究。结果表明,将CHASMA与其他生成数据集结合使用,持续提升了FIGMENTS在二分类(+6.26%)和多分类(+15.8%)设置下的整体性能。我们的代码已开源在:https://github.com/stevejpapad/figments-and-misalignments