Synthetic Misinformers: Generating and Combating Multimodal Misinformation

With the expansion of social media and the increasing dissemination of multimedia content, the spread of misinformation has become a major concern. This necessitates effective strategies for multimodal misinformation detection (MMD) that detect whether the combination of an image and its accompanying text could mislead or misinform. Due to the data-intensive nature of deep neural networks and the labor-intensive process of manual annotation, researchers have been exploring various methods for automatically generating synthetic multimodal misinformation - which we refer to as Synthetic Misinformers - in order to train MMD models. However, limited evaluation on real-world misinformation and a lack of comparisons with other Synthetic Misinformers makes difficult to assess progress in the field. To address this, we perform a comparative study on existing and new Synthetic Misinformers that involves (1) out-of-context (OOC) image-caption pairs, (2) cross-modal named entity inconsistency (NEI) as well as (3) hybrid approaches and we evaluate them against real-world misinformation; using the COSMOS benchmark. The comparative study showed that our proposed CLIP-based Named Entity Swapping can lead to MMD models that surpass other OOC and NEI Misinformers in terms of multimodal accuracy and that hybrid approaches can lead to even higher detection accuracy. Nevertheless, after alleviating information leakage from the COSMOS evaluation protocol, low Sensitivity scores indicate that the task is significantly more challenging than previous studies suggested. Finally, our findings showed that NEI-based Synthetic Misinformers tend to suffer from a unimodal bias, where text-only MMDs can outperform multimodal ones.

翻译：摘要：随着社交媒体的扩张与多媒体内容日益广泛的传播，虚假信息的扩散已成为重大关切。这促使我们需要有效的多模态虚假信息检测策略，以判断图像及其配文组合是否可能产生误导或传播虚假信息。由于深度神经网络对数据量的高度依赖以及人工标注的高劳动密集性，研究人员一直在探索自动生成合成多模态虚假信息（我们称之为"合成虚假信息"）的多种方法，以训练多模态虚假信息检测模型。然而，由于缺乏对真实世界虚假信息的充分评估以及与其他合成虚假信息的比较，该领域的进展评估变得困难。为解决这一问题，我们针对现有及新型合成虚假信息方法进行了比较研究，涵盖：（1）脱离语境图像-文本对；（2）跨模态命名实体不一致性检测；以及（3）混合方法。我们利用COSMOS基准将其与真实世界虚假信息进行对比评估。比较研究表明，我们提出的基于CLIP的命名实体交换方法可使多模态虚假信息检测模型在多模态准确率上超越其他脱离语境与命名实体不一致检测类合成虚假信息方法，且混合方法能实现更高检测准确率。然而，在缓解COSMOS评估协议中的信息泄露问题后，低灵敏度得分表明该任务比以往研究显示的更具挑战性。最后，我们的研究显示基于命名实体不一致的合成虚假信息方法易存在单模态偏差，即纯文本多模态虚假信息检测模型可能优于多模态模型。