Multimodal Large Language Models (MLLMs) demonstrate impressive image understanding and generating capabilities. However, existing benchmarks employ limited charts that deviate from real-world scenarios, posing challenges in accurately assessing the chart comprehension of MLLMs. To overcome this constraint, we propose ChartBench, an exhaustive chart benchmark specifically designed to evaluate MLLMs' chart comprehension and data reliability through complex visual reasoning. ChartBench encompasses a wide spectrum, including 42 categories, 2.1K charts, and 16.8K question-answer pairs. Diverging from previous benchmarks, ChartBench avoids employing data point annotation charts or metadata prompts directly. Instead, it compels MLLMs to derive values akin to human understanding by leveraging inherent chart elements such as color, legends, or coordinate systems. Additionally, we propose an enhanced evaluation metric, Acc+, which facilitates the evaluation of MLLMs without needing labor-intensive manual efforts or costly evaluations based on GPT. Our extensive experimental evaluation involves 12 widely-used open-sourced and 2 proprietary MLLMs, revealing the limitations of MLLMs in interpreting charts and providing valuable insights to encourage closer scrutiny of this aspect.
翻译:多模态大型语言模型(MLLMs)展现出令人印象深刻的图像理解与生成能力。然而,现有基准测试采用的图表有限且偏离真实场景,这给准确评估MLLMs的图表理解能力带来挑战。为克服这一局限,我们提出ChartBench——一个专为通过复杂视觉推理评估MLLMs图表理解能力与数据可靠性而设计的全面图表基准。ChartBench涵盖广泛范畴,包含42个类别、2100张图表及1.68万个问答对。与先前基准不同,ChartBench避免直接使用数据点标注图表或元数据提示,而是迫使MLLMs像人类理解一样,通过利用图表固有元素(如颜色、图例或坐标系)推导数值。此外,我们提出一种改进的评估指标Acc+,该指标无需繁重人工劳动或基于GPT的高成本评估即可促进对MLLMs的评价。我们的大规模实验评估涉及12个广泛使用的开源MLLMs和2个专有MLLMs,揭示了MLLMs在解读图表方面的局限性,并为推动对这一方面的更深入研究提供了宝贵见解。