Multimodal Large Language Models (MLLMs) have demonstrated impressive abilities across various tasks, including visual question answering and chart comprehension, yet existing benchmarks for chart-related tasks fall short in capturing the complexity of real-world multi-chart scenarios. Current benchmarks primarily focus on single-chart tasks, neglecting the multi-hop reasoning required to extract and integrate information from multiple charts, which is essential in practical applications. To fill this gap, we introduce MultiChartQA, a benchmark that evaluates MLLMs' capabilities in four key areas: direct question answering, parallel question answering, comparative reasoning, and sequential reasoning. Our evaluation of a wide range of MLLMs reveals significant performance gaps compared to humans. These results highlight the challenges in multi-chart comprehension and the potential of MultiChartQA to drive advancements in this field. Our code and data are available at https://github.com/Zivenzhu/Multi-chart-QA
翻译:多模态大语言模型(MLLMs)在视觉问答和图表理解等多种任务中展现出卓越能力,然而现有的图表相关任务基准测试未能充分捕捉现实世界中多图表场景的复杂性。当前基准测试主要聚焦于单图表任务,忽视了从多个图表中提取并整合信息所需的多跳推理能力,而这在实际应用中至关重要。为填补这一空白,我们提出了MultiChartQA基准测试,用于评估MLLMs在四个关键领域的能力:直接问答、并行问答、比较推理和序列推理。通过对多种MLLMs的广泛评估,我们发现其性能与人类水平存在显著差距。这些结果凸显了多图表理解面临的挑战,以及MultiChartQA推动该领域发展的潜力。我们的代码与数据已发布于 https://github.com/Zivenzhu/Multi-chart-QA。