Existing claim verification datasets often do not require systems to perform complex reasoning or effectively interpret multimodal evidence. To address this, we introduce a new task: multi-hop multimodal claim verification. This task challenges models to reason over multiple pieces of evidence from diverse sources, including text, images, and tables, and determine whether the combined multimodal evidence supports or refutes a given claim. To study this task, we construct MMCV, a large-scale dataset comprising 15k multi-hop claims paired with multimodal evidence, generated and refined using large language models, with additional input from human feedback. We show that MMCV is challenging even for the latest state-of-the-art multimodal large language models, especially as the number of reasoning hops increases. Additionally, we establish a human performance benchmark on a subset of MMCV. We hope this dataset and its evaluation task will encourage future research in multimodal multi-hop claim verification.
翻译:现有的主张验证数据集通常不要求系统执行复杂推理或有效解读多模态证据。为解决这一问题,我们提出一项新任务:多跳多模态主张验证。该任务要求模型对来自文本、图像和表格等多种来源的多条证据进行推理,并判定组合后的多模态证据是否支持或反驳给定主张。为研究此任务,我们构建了MMCV数据集——一个包含1.5万条多跳主张及其对应多模态证据的大规模数据集,该数据集通过大语言模型生成并优化,并辅以人工反馈。研究表明,即使对于当前最先进的多模态大语言模型,MMCV仍具有挑战性,且随着推理跳数增加难度显著提升。此外,我们在MMCV子集上建立了人类性能基准。我们希望该数据集及其评估任务能推动多模态多跳主张验证领域的未来研究。