Evaluating Task-based Effectiveness of MLLMs on Charts

In this paper, we explore a forward-thinking question: Is GPT-4V effective at low-level data analysis tasks on charts? To this end, we first curate a large-scale dataset, named ChartInsights, consisting of 89,388 quartets (chart, task, question, answer) and covering 10 widely-used low-level data analysis tasks on 7 chart types. Firstly, we conduct systematic evaluations to understand the capabilities and limitations of 18 advanced MLLMs, which include 12 open-source models and 6 closed-source models. Starting with a standard textual prompt approach, the average accuracy rate across the 18 MLLMs is 36.17%. Among all the models, GPT-4V achieves the highest accuracy, reaching 56.13%. To understand the limitations of multimodal large models in low-level data analysis tasks, we have designed various experiments to conduct an in-depth test of capabilities of GPT-4V. We further investigate how visual modifications to charts, such as altering visual elements (e.g. changing color schemes) and introducing perturbations (e.g. adding image noise), affect performance of GPT-4V. Secondly, we present 12 experimental findings. These findings suggest potential of GPT-4V to revolutionize interaction with charts and uncover the gap between human analytic needs and capabilities of GPT-4V. Thirdly, we propose a novel textual prompt strategy, named Chain-of-Charts, tailored for low-level analysis tasks, which boosts model performance by 24.36%, resulting in an accuracy of 80.49%. Furthermore, by incorporating a visual prompt strategy that directs attention of GPT-4V to question-relevant visual elements, we further improve accuracy to 83.83%. Our study not only sheds light on the capabilities and limitations of GPT-4V in low-level data analysis tasks but also offers valuable insights for future research.

翻译：本文探讨了一个前瞻性问题：GPT-4V在图表底层数据分析任务中是否有效？为此，我们首先构建了一个大规模数据集ChartInsights，包含89,388个四元组（图表、任务、问题、答案），覆盖7种图表类型上的10种常用底层数据分析任务。首先，我们对18个先进多模态大语言模型（包括12个开源模型和6个闭源模型）进行了系统评估，以理解其能力与局限。采用标准文本提示方法时，18个模型的平均准确率为36.17%。其中GPT-4V以56.13%的准确率位居首位。为探究多模态大模型在底层数据分析任务中的局限性，我们设计了多种实验对GPT-4V能力进行深度测试。进一步研究了图表视觉修改（如改变配色方案等视觉元素调整、添加图像噪声等扰动）对GPT-4V性能的影响。其次，我们归纳了12项实验发现。这些发现揭示了GPT-4V革新图表交互方式的潜力，同时暴露了人类分析需求与GPT-4V能力之间的差距。第三，我们提出了专用于底层分析任务的新型文本提示策略Chain-of-Charts，将模型性能提升24.36%，达到80.49%的准确率。通过引入引导GPT-4V关注问题相关视觉元素的视觉提示策略，我们将准确率进一步优化至83.83%。本研究不仅阐明了GPT-4V在底层数据分析任务中的能力边界，也为未来研究提供了重要启示。