Existing claim verification datasets often do not require systems to perform complex reasoning or effectively interpret multimodal evidence. To address this, we introduce a new task: multi-hop multimodal claim verification. This task challenges models to reason over multiple pieces of evidence from diverse sources, including text, images, and tables, and determine whether the combined multimodal evidence supports or refutes a given claim. To study this task, we construct MMCV, a large-scale dataset comprising 16k multi-hop claims paired with multimodal evidence, generated and refined using large language models, with additional input from human feedback. We show that MMCV is challenging even for the latest state-of-the-art multimodal large language models, especially as the number of reasoning hops increases. Additionally, we establish a human performance benchmark on a subset of MMCV. We hope this dataset and its evaluation task will encourage future research in multimodal multi-hop claim verification.
翻译:现有的声明验证数据集通常不要求系统执行复杂推理或有效解读多模态证据。为解决这一问题,我们提出了一项新任务:多跳多模态声明验证。该任务要求模型对来自文本、图像和表格等多种来源的多条证据进行推理,并判断组合后的多模态证据是否支持或反驳给定声明。为研究此任务,我们构建了MMCV——一个包含1.6万条多跳声明及其对应多模态证据的大规模数据集,该数据集通过大语言模型生成并优化,并辅以人工反馈。研究表明,即使对于最新的先进多模态大语言模型,MMCV仍具有挑战性,尤其是在推理跳数增加时。此外,我们在MMCV的子集上建立了人类性能基准。我们希望该数据集及其评估任务能推动多模态多跳声明验证领域的未来研究。