Multimodal multihop question answering is a complex task that requires reasoning over multiple sources of information, such as images and text, to answer questions. While there has been significant progress in visual question answering, the multihop setting remains unexplored due to the lack of high-quality datasets. Current methods focus on single-hop question answering or a single modality, which makes them unsuitable for real-world scenarios such as analyzing multimodal educational materials, summarizing lengthy academic articles, or interpreting scientific studies that combine charts, images, and text. To address this gap, we propose a novel methodology, introducing the first framework for creating a high-quality dataset that enables training models for multimodal multihop question answering. Our approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure quality data. We evaluate our methodology by training models on our synthesized dataset and testing on two benchmarks, our results demonstrate that, with an equal sample size, models trained on our synthesized data outperform those trained on human-collected data by 1.9 in exact match (EM) on average. We believe our data synthesis method will serve as a strong foundation for training and evaluating multimodal multihop question answering models.
翻译:多模态多跳问答是一项复杂的任务,需要通过对多种信息源(如图像和文本)进行推理来回答问题。尽管视觉问答领域已取得显著进展,但由于缺乏高质量数据集,多跳设置的研究仍处于空白状态。现有方法主要关注单跳问答或单一模态,这使其难以适用于现实场景,例如分析多模态教育材料、总结长篇学术文章或解读结合图表、图像与文本的科学研究。为填补这一空白,我们提出一种新颖方法,首次引入用于构建高质量数据集的框架,以支持多模态多跳问答模型的训练。我们的方法包含一个五阶段流程:从维基百科获取相关多模态文档,通过合成方式生成高层次问题与答案,并依据严格标准进行验证以确保数据质量。我们通过在合成数据集上训练模型并在两个基准测试上进行评估来验证该方法,结果表明:在同等样本量下,使用我们合成数据训练的模型在精确匹配(EM)指标上平均优于基于人工收集数据训练的模型1.9分。我们相信,该数据合成方法将为多模态多跳问答模型的训练与评估奠定坚实基础。