This paper introduces the human-curated PandasPlotBench dataset, designed to evaluate language models' effectiveness as assistants in visual data exploration. Our benchmark focuses on generating code for visualizing tabular data - such as a Pandas DataFrame - based on natural language instructions, complementing current evaluation tools and expanding their scope. The dataset includes 175 unique tasks. Our experiments assess several leading Large Language Models (LLMs) across three visualization libraries: Matplotlib, Seaborn, and Plotly. We show that the shortening of tasks has a minimal effect on plotting capabilities, allowing for the user interface that accommodates concise user input without sacrificing functionality or accuracy. Another of our findings reveals that while LLMs perform well with popular libraries like Matplotlib and Seaborn, challenges persist with Plotly, highlighting areas for improvement. We hope that the modular design of our benchmark will broaden the current studies on generating visualizations. Our dataset and benchmark code are available online: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench; https://github.com/JetBrains-Research/PandasPlotBench.
翻译:本文介绍了人工整理的PandasPlotBench数据集,旨在评估语言模型作为可视化数据探索助手的有效性。我们的基准专注于根据自然语言指令生成可视化表格数据(如Pandas DataFrame)的代码,对现有评估工具进行了补充并扩展了其范围。该数据集包含175个独立任务。我们通过实验评估了多个主流大语言模型在三种可视化库(Matplotlib、Seaborn和Plotly)上的表现。研究表明,任务描述的缩短对绘图能力影响甚微,这使得用户界面能够容纳简洁的用户输入,同时不牺牲功能或准确性。另一发现表明,虽然大语言模型在Matplotlib和Seaborn等流行库上表现良好,但在Plotly方面仍存在挑战,这指明了需要改进的领域。我们希望本基准的模块化设计能够拓展当前可视化生成的研究范畴。我们的数据集与基准代码已在线发布:https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench;https://github.com/JetBrains-Research/PandasPlotBench。