Towards Understanding Graphical Perception in Large Multimodal Models

Despite the promising results of large multimodal models (LMMs) in complex vision-language tasks that require knowledge, reasoning, and perception abilities together, we surprisingly found that these models struggle with simple tasks on infographics that require perception only. As existing benchmarks primarily focus on end tasks that require various abilities, they provide limited, fine-grained insights into the limitations of the models' perception abilities. To address this gap, we leverage the theory of graphical perception, an approach used to study how humans decode visual information encoded on charts and graphs, to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. With automated task generation and response evaluation designs, our framework enables comprehensive and controlled testing of LMMs' graphical perception across diverse chart types, visual elements, and task types. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three granularity levels (chart, visual element, and pixel). Our findings underscore several critical limitations of current state-of-the-art LMMs, including GPT-4o: their inability to (1) generalize across chart types, (2) understand fundamental visual elements, and (3) cross reference values within a chart. These insights provide guidance for future improvements in perception abilities of LMMs. The evaluation framework and labeled data are publicly available at https://github.com/microsoft/lmm-graphical-perception.

翻译：尽管大型多模态模型（LMMs）在需要知识、推理和感知能力相结合的复杂视觉语言任务中取得了令人瞩目的成果，但我们意外地发现，这些模型在仅需感知能力的简单信息图表任务上却表现不佳。由于现有基准测试主要关注需要多种能力的最终任务，它们对模型感知能力局限性的细粒度洞察有限。为弥补这一空白，我们借鉴图形感知理论——一种研究人类如何解码图表中编码的视觉信息的方法——开发了一个评估框架，用于分析LMMs在图表感知能力上的缺陷。通过自动化任务生成与响应评估设计，我们的框架能够对LMMs在不同图表类型、视觉元素和任务类型中的图形感知能力进行全面且受控的测试。我们应用该框架从三个粒度级别（图表、视觉元素和像素）评估并诊断了包括GPT-4o在内的前沿LMMs的感知能力。研究结果揭示了当前最先进LMMs的几个关键局限性：它们无法（1）跨图表类型泛化，（2）理解基本视觉元素，以及（3）在图表内交叉参考数值。这些发现为未来提升LMMs的感知能力提供了指导。评估框架与标注数据已公开于 https://github.com/microsoft/lmm-graphical-perception。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日