CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Zirui Wang,Mengzhou Xia,Luxi He,Howard Chen,Yitao Liu,Richard Zhu,Kaiqu Liang,Xindi Wu,Haotian Liu,Sadhika Malladi,Alexis Chevalier,Sanjeev Arora,Danqi Chen

from arxiv, 121 pages, 90 figures

Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/

翻译：图表理解在将多模态大语言模型应用于现实世界任务（如分析科学论文或财务报告）时发挥着关键作用。然而，现有数据集通常关注过于简化且同质化的图表以及基于模板的问题，导致对进展的评估过于乐观。我们证明，尽管开源模型在这些基准测试中可能看似优于强大的专有模型，但使用稍有不同的图表或问题进行简单压力测试即可使性能下降高达34.5%。在本工作中，我们提出CharXiv，一个全面的评估套件，包含来自arXiv论文的2,323个自然、具有挑战性且多样化的图表。CharXiv包含两类问题：1) 关于检查基本图表元素的描述性问题；2) 需要综合图表中复杂视觉信息进行推理的问题。为确保质量，所有图表和问题均由人类专家手工挑选、整理和验证。我们的结果揭示了一个先前被低估的巨大差距：最强专有模型（即GPT-4o）的推理准确率达到47.1%，而最强开源模型（即InternVL Chat V1.5）的准确率为29.2%。所有模型均远低于人类80.5%的表现水平，这凸显出现有多模态大语言模型在图表理解能力上的不足。我们希望CharXiv通过提供更真实、更可靠的进展衡量标准，推动未来关于多模态大语言模型图表理解的研究。项目页面与排行榜：https://charxiv.github.io/