PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation

The ability of large language models (LLMs) to interpret visual representations of data is crucial for advancing their application in data analysis and decision-making processes. This paper presents a novel synthetic dataset designed to evaluate the proficiency of LLMs in interpreting various forms of data visualizations, including plots like time series, histograms, violins, boxplots, and clusters. Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios. We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models like ChatGPT or Gemini, assessing their understanding and interpretative accuracy. To ensure data integrity, our benchmark dataset is generated automatically, making it entirely new and free from prior exposure to the models being tested. This strategy allows us to evaluate the models' ability to truly interpret and understand the data, eliminating possibility of pre-learned responses, and allowing for an unbiased evaluation of the models' capabilities. We also introduce quantitative metrics to assess the performance of the models, providing a robust and comprehensive evaluation tool. Benchmarking several state-of-the-art LLMs with this dataset reveals varying degrees of success, highlighting specific strengths and weaknesses in interpreting diverse types of visual data. The results provide valuable insights into the current capabilities of LLMs and identify key areas for improvement. This work establishes a foundational benchmark for future research and development aimed at enhancing the visual interpretative abilities of language models. In the future, improved LLMs with robust visual interpretation skills can significantly aid in automated data analysis, scientific research, educational tools, and business intelligence applications.

翻译：大语言模型（LLMs）解读数据视觉表征的能力对于推动其在数据分析和决策过程中的应用至关重要。本文提出了一种新颖的合成数据集，旨在评估LLMs解读多种形式数据可视化（包括时间序列图、直方图、小提琴图、箱线图和聚类图等）的熟练程度。我们的数据集通过控制参数生成，以确保全面覆盖潜在的真实世界场景。我们采用包含图像中视觉数据相关问题的多模态文本提示，对ChatGPT、Gemini等若干先进模型进行基准测试，评估其理解与解读准确性。为确保数据完整性，我们的基准数据集为自动生成，完全新颖且未被测试模型预先接触过。这一策略使我们能够评估模型真正解读和理解数据的能力，排除了预学习响应的可能性，从而实现对模型能力的无偏评估。我们还引入了量化指标来评估模型的性能，提供了一个稳健且全面的评估工具。使用该数据集对多个先进LLM进行基准测试的结果显示出不同程度的成功，突显了它们在解读不同类型视觉数据时的具体优势与不足。这些结果为理解LLMs当前能力提供了宝贵见解，并指明了关键的改进方向。本工作为未来旨在增强语言模型视觉解读能力的研究与开发奠定了基础基准。未来，具备强大视觉解读能力的改进型LLM可显著助力自动化数据分析、科学研究、教育工具及商业智能应用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日