Multimodal multihop question answering is a complex task that requires reasoning over multiple sources of information, such as images and text, to answer questions. While there has been significant progress in visual question answering, the multihop setting remains unexplored due to the lack of high-quality datasets. Current methods focus on single-hop question answering or a single modality, which makes them unsuitable for real-world scenarios such as analyzing multimodal educational materials, summarizing lengthy academic articles, or interpreting scientific studies that combine charts, images, and text. To address this gap, we propose a novel methodology, introducing the first framework for creating a high-quality dataset that enables training models for multimodal multihop question answering. Our approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure quality data. We evaluate our methodology by training models on our synthesized dataset and testing on two benchmarks, our results demonstrate that, with an equal sample size, models trained on our synthesized data outperform those trained on human-collected data by 1.9 in exact match (EM) on average. We believe our data synthesis method will serve as a strong foundation for training and evaluating multimodal multihop question answering models.
翻译:多模态多跳问答是一项复杂的任务,需要综合推理多种信息源(如图像和文本)以回答问题。尽管视觉问答领域已取得显著进展,但由于缺乏高质量数据集,多跳设定仍未得到充分探索。现有方法主要关注单跳问答或单一模态,这使其难以适用于现实场景,例如分析多模态教育材料、总结长篇学术文章或解读结合图表、图像与文本的科学研究。为填补这一空白,我们提出一种创新方法,首次引入用于构建高质量数据集的框架,以支持多模态多跳问答模型的训练。我们的方法包含五阶段流程:从维基百科获取相关多模态文档,合成生成高层次问题与答案,并通过严格标准进行验证以确保数据质量。我们通过在合成数据集上训练模型并在两个基准测试中评估该方法,结果表明:在相同样本量下,使用我们合成数据训练的模型在精确匹配(EM)指标上平均优于使用人工收集数据训练的模型1.9分。我们相信,该数据合成方法将为多模态多跳问答模型的训练与评估奠定坚实基础。