Visual Question Answering (VQA) has become an important benchmark for assessing how large multimodal models (LMMs) interpret images. However, most VQA datasets focus on real-world images or simple diagrammatic analysis, with few focused on interpreting complex scientific charts. Indeed, many VQA datasets that analyze charts do not contain the underlying data behind those charts or assume a 1-to-1 correspondence between chart marks and underlying data. In reality, charts are transformations (i.e. analysis, simplification, modification) of data. This distinction introduces a reasoning challenge in VQA that the current datasets do not capture. In this paper, we argue for a dedicated VQA benchmark for scientific charts where there is no 1-to-1 correspondence between chart marks and underlying data. To do so, we survey existing VQA datasets and highlight limitations of the current field. We then generate synthetic histogram charts based on ground truth data, and ask both humans and a large reasoning model questions where precise answers depend on access to the underlying data. We release the open-source dataset, including figures, underlying data, distribution parameters used to generate the data, and bounding boxes for all figure marks and text for future research.
翻译:视觉问答已成为评估大型多模态模型图像理解能力的重要基准。然而,大多数VQA数据集聚焦于真实世界图像或简单的图解分析,鲜有关注复杂科学图表的解读。事实上,许多分析图表的VQA数据集并未包含图表背后的底层数据,或假设图表标记与底层数据存在一一对应关系。现实中,图表是对数据的转换(即分析、简化、修改)。这种差异给VQA带来了当前数据集未能捕捉的推理挑战。本文主张为科学图表建立专门的VQA基准,其中图表标记与底层数据不存在一一对应关系。为此,我们系统调研了现有VQA数据集并指出当前领域的局限性。随后基于真实数据生成合成直方图,向人类和大型推理模型提出需要依赖底层数据才能获得精确答案的问题。我们开源了该数据集,包含图表图像、底层数据、用于生成数据的分布参数,以及所有图表标记和文本的边界框,以促进未来研究。