This paper introduces the human-curated PandasPlotBench dataset, designed to evaluate language models' effectiveness as assistants in visual data exploration. Our benchmark focuses on generating code for visualizing tabular data - such as a Pandas DataFrame - based on natural language instructions, complementing current evaluation tools and expanding their scope. The dataset includes 175 unique tasks. Our experiments assess several leading Large Language Models (LLMs) across three visualization libraries: Matplotlib, Seaborn, and Plotly. We show that the shortening of tasks has a minimal effect on plotting capabilities, allowing for the user interface that accommodates concise user input without sacrificing functionality or accuracy. Another of our findings reveals that while LLMs perform well with popular libraries like Matplotlib and Seaborn, challenges persist with Plotly, highlighting areas for improvement. We hope that the modular design of our benchmark will broaden the current studies on generating visualizations. Our benchmark is available online: https://huggingface.co/datasets/JetBrains-Research/plot_bench. The code for running the benchmark is also available: https://github.com/JetBrains-Research/PandasPlotBench.
翻译:本文介绍了人工策划的PandasPlotBench数据集,旨在评估语言模型作为可视化数据探索助手的有效性。我们的基准专注于根据自然语言指令为表格数据(如Pandas DataFrame)生成可视化代码,对现有评估工具进行了补充并扩展了其范围。该数据集包含175个独立任务。我们通过实验评估了多个主流大语言模型在三个可视化库(Matplotlib、Seaborn和Plotly)上的表现。研究表明,任务描述的缩短对绘图能力影响甚微,这使得用户界面能够容纳简洁的用户输入,同时不牺牲功能或准确性。另一发现表明,虽然大语言模型在Matplotlib和Seaborn等流行库上表现良好,但在Plotly上仍存在挑战,这指明了需要改进的领域。我们希望本基准的模块化设计能够拓宽当前可视化生成的研究范畴。我们的基准已在线发布:https://huggingface.co/datasets/JetBrains-Research/plot_bench。运行基准的代码亦已开源:https://github.com/JetBrains-Research/PandasPlotBench。