Despite the promising results of large multimodal models (LMMs) in complex vision-language tasks that require knowledge, reasoning, and perception abilities together, we surprisingly found that these models struggle with simple tasks on infographics that require perception only. As existing benchmarks primarily focus on end tasks that require various abilities, they provide limited, fine-grained insights into the limitations of the models' perception abilities. To address this gap, we leverage the theory of graphical perception, an approach used to study how humans decode visual information encoded on charts and graphs, to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. With automated task generation and response evaluation designs, our framework enables comprehensive and controlled testing of LMMs' graphical perception across diverse chart types, visual elements, and task types. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three granularity levels (chart, visual element, and pixel). Our findings underscore several critical limitations of current state-of-the-art LMMs, including GPT-4o: their inability to (1) generalize across chart types, (2) understand fundamental visual elements, and (3) cross reference values within a chart. These insights provide guidance for future improvements in perception abilities of LMMs. The evaluation framework and labeled data are publicly available at https://github.com/microsoft/lmm-graphical-perception.
翻译:尽管大型多模态模型(LMMs)在需要知识、推理和感知能力相结合的复杂视觉语言任务中取得了令人瞩目的成果,但我们意外地发现,这些模型在仅需感知能力的简单信息图表任务上却表现不佳。由于现有基准测试主要关注需要多种能力的最终任务,它们对模型感知能力局限性的细粒度洞察有限。为弥补这一空白,我们借鉴图形感知理论——一种研究人类如何解码图表中编码的视觉信息的方法——开发了一个评估框架,用于分析LMMs在图表感知能力上的缺陷。通过自动化任务生成与响应评估设计,我们的框架能够对LMMs在不同图表类型、视觉元素和任务类型中的图形感知能力进行全面且受控的测试。我们应用该框架从三个粒度级别(图表、视觉元素和像素)评估并诊断了包括GPT-4o在内的前沿LMMs的感知能力。研究结果揭示了当前最先进LMMs的几个关键局限性:它们无法(1)跨图表类型泛化,(2)理解基本视觉元素,以及(3)在图表内交叉参考数值。这些发现为未来提升LMMs的感知能力提供了指导。评估框架与标注数据已公开于 https://github.com/microsoft/lmm-graphical-perception。