The ability of large language models (LLMs) to interpret visual representations of data is crucial for advancing their application in data analysis and decision-making processes. This paper presents a novel synthetic dataset designed to evaluate the proficiency of LLMs in interpreting various forms of data visualizations, including plots like time series, histograms, violins, boxplots, and clusters. Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios. We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models like ChatGPT or Gemini, assessing their understanding and interpretative accuracy. To ensure data integrity, our benchmark dataset is generated automatically, making it entirely new and free from prior exposure to the models being tested. This strategy allows us to evaluate the models' ability to truly interpret and understand the data, eliminating possibility of pre-learned responses, and allowing for an unbiased evaluation of the models' capabilities. We also introduce quantitative metrics to assess the performance of the models, providing a robust and comprehensive evaluation tool. Benchmarking several state-of-the-art LLMs with this dataset reveals varying degrees of success, highlighting specific strengths and weaknesses in interpreting diverse types of visual data. The results provide valuable insights into the current capabilities of LLMs and identify key areas for improvement. This work establishes a foundational benchmark for future research and development aimed at enhancing the visual interpretative abilities of language models. In the future, improved LLMs with robust visual interpretation skills can significantly aid in automated data analysis, scientific research, educational tools, and business intelligence applications.
翻译:大语言模型(LLMs)解读数据视觉表征的能力对于推动其在数据分析和决策过程中的应用至关重要。本文提出了一种新颖的合成数据集,旨在评估LLMs解读多种形式数据可视化(包括时间序列图、直方图、小提琴图、箱线图和聚类图等)的熟练程度。我们的数据集通过控制参数生成,以确保全面覆盖潜在的真实世界场景。我们采用包含图像中视觉数据相关问题的多模态文本提示,对ChatGPT、Gemini等若干先进模型进行基准测试,评估其理解与解读准确性。为确保数据完整性,我们的基准数据集为自动生成,完全新颖且未被测试模型预先接触过。这一策略使我们能够评估模型真正解读和理解数据的能力,排除了预学习响应的可能性,从而实现对模型能力的无偏评估。我们还引入了量化指标来评估模型的性能,提供了一个稳健且全面的评估工具。使用该数据集对多个先进LLM进行基准测试的结果显示出不同程度的成功,突显了它们在解读不同类型视觉数据时的具体优势与不足。这些结果为理解LLMs当前能力提供了宝贵见解,并指明了关键的改进方向。本工作为未来旨在增强语言模型视觉解读能力的研究与开发奠定了基础基准。未来,具备强大视觉解读能力的改进型LLM可显著助力自动化数据分析、科学研究、教育工具及商业智能应用。