Online disinformation poses an escalating threat to society, driven increasingly by the rapid spread of misleading content across both multimedia and multilingual platforms. While automated fact-checking methods have advanced in recent years, their effectiveness remains constrained by the scarcity of datasets that reflect these real-world complexities. To address this gap, we first present MultiCaption, a new dataset specifically designed for detecting contradictions in visual claims. Pairs of claims referring to the same image or video were labeled through multiple strategies to determine whether they contradict each other. The resulting dataset comprises 11,088 visual claims in 64 languages, offering a unique resource for building and evaluating misinformation-detection systems in truly multimodal and multilingual environments. We then provide comprehensive experiments using transformer-based architectures, natural language inference models, and large language models, establishing strong baselines for future research. The results show that MultiCaption is more challenging than standard NLI tasks, requiring task-specific finetuning for strong performance. Moreover, the gains from multilingual training and testing highlight the dataset's potential for building effective multilingual fact-checking pipelines without relying on machine translation.
翻译:网络虚假信息对社会构成日益严重的威胁,其传播主要受误导性内容在多媒体和多语言平台上的快速扩散所驱动。尽管近年来自动化事实核查方法已取得进展,但其有效性仍受限于反映这些现实世界复杂性的数据集稀缺。为弥补这一缺口,我们首先提出了MultiCaption——一个专门为检测视觉声明中的矛盾而设计的新数据集。通过多种标注策略对指向同一图像或视频的声明对进行标记,以判定它们是否相互矛盾。最终数据集包含64种语言的11,088条视觉声明,为在真正多模态和多语言环境中构建与评估虚假信息检测系统提供了独特资源。随后我们基于Transformer架构、自然语言推理模型和大语言模型进行了全面实验,为未来研究建立了强基准。结果表明,MultiCaption比标准自然语言推理任务更具挑战性,需要针对特定任务进行微调才能获得优异性能。此外,多语言训练与测试带来的性能提升凸显了该数据集在构建不依赖机器翻译的有效多语言事实核查流程方面的潜力。