We propose end-to-end multimodal fact-checking and explanation generation, where the input is a claim and a large collection of web sources, including articles, images, videos, and tweets, and the goal is to assess the truthfulness of the claim by retrieving relevant evidence and predicting a truthfulness label (e.g., support, refute or not enough information), and to generate a statement to summarize and explain the reasoning and ruling process. To support this research, we construct Mocheg, a large-scale dataset consisting of 15,601 claims where each claim is annotated with a truthfulness label and a ruling statement, and 33,880 textual paragraphs and 12,112 images in total as evidence. To establish baseline performances on Mocheg, we experiment with several state-of-the-art neural architectures on the three pipelined subtasks: multimodal evidence retrieval, claim verification, and explanation generation, and demonstrate that the performance of the state-of-the-art end-to-end multimodal fact-checking does not provide satisfactory outcomes. To the best of our knowledge, we are the first to build the benchmark dataset and solutions for end-to-end multimodal fact-checking and explanation generation. The dataset, source code and model checkpoints are available at https://github.com/VT-NLP/Mocheg.
翻译:我们提出端到端的多模态事实核查与解释生成任务,输入包含一个声明及大量网络来源(包括文章、图片、视频和推文),目标是通过检索相关证据预测声明的真实性标签(如支持、反驳或信息不足),并生成总结性陈述以阐释推理与裁决过程。为支持该研究,我们构建了Mocheg大规模数据集,包含15,601个声明,每个声明标注有真实性标签和裁决陈述,以及作为证据的33,880个文本段落与12,112张图片。为建立Mocheg上的基线性能,我们采用多种先进神经架构对三个流水式子任务(多模态证据检索、声明验证与解释生成)进行实验,结果表明当前最先进的端到端多模态事实核查方法未能产出令人满意的结果。据我们所知,这是首个为端到端多模态事实核查与解释生成构建的基准数据集与解决方案。数据集、源代码及模型检查点已公开于https://github.com/VT-NLP/Mocheg。