Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, where the visual representation may fail to encapsulate the correct information, and the extraction bottleneck, where the language model struggles to extract the necessary information from the provided visual representations. Through comprehensive experiments, we find that (1) the information embedded within visual representations is substantially richer than what is typically captured by linear extractors, such as the widely used retrieval accuracy metric; (2) While instruction tuning effectively enhances the extraction capability of LVLMs, the vision encoder remains a critical bottleneck, demanding focused attention and improvement. Therefore, we further enhance the visual encoder to mitigate the vision encoder bottleneck under a contrastive learning framework. Empirical results demonstrate that our approach significantly mitigates the perception bottleneck and improves the ability of LVLMs to comprehend charts. Code is publicly available at https://github.com/hkust-nlp/Vision4Chart.
翻译:图表理解要求模型能够有效分析和推理数值数据、文本元素及复杂视觉组件。我们的观察发现,现有大型视觉语言模型(LVLMs)的感知能力构成了这一过程中的关键瓶颈。本研究通过将该感知瓶颈分解为两个组成部分来深入探讨:视觉编码器瓶颈(视觉表征可能无法封装正确信息)和提取瓶颈(语言模型难以从提供的视觉表征中提取必要信息)。通过全面实验,我们发现:(1)视觉表征中嵌入的信息远超过线性提取器(如广泛使用的检索准确率指标)通常捕获的内容;(2)虽然指令微调能有效提升LVLMs的提取能力,但视觉编码器仍是关键瓶颈,需要重点关注和改进。为此,我们在对比学习框架下进一步优化视觉编码器以缓解视觉编码器瓶颈。实证结果表明,我们的方法显著缓解了感知瓶颈,提升了LVLMs理解图表的能力。代码公开于https://github.com/hkust-nlp/Vision4Chart。