Large language models are able to generate code for visualisations in response to user requests. This is a useful application, and an appealing one for NLP research because plots of data provide grounding for language. However, there are relatively few benchmarks, and it is unknown whether those that exist are representative of what people do in practice. This paper aims to answer that question through an empirical study comparing benchmark datasets and code from public repositories. Our findings reveal a substantial gap in datasets, with evaluations not testing the same distribution of chart types, attributes, and the number of actions. The only representative dataset requires modification to become an end-to-end and practical benchmark. This shows that new, more benchmarks are needed to support the development of systems that truly address users' visualisation needs. These observations will guide future data creation, highlighting which features hold genuine significance for users.
翻译:大型语言模型能够根据用户请求生成可视化代码。这是一个有用的应用,也是自然语言处理研究中一个吸引人的方向,因为数据图为语言提供了基础。然而,目前存在的基准测试相对较少,且尚不清楚现有的基准测试是否能够代表人们在实践中的真实需求。本文旨在通过一项实证研究来回答这个问题,该研究比较了基准数据集和来自公共代码仓库的代码。我们的研究结果揭示了数据集之间存在显著差距,现有评估并未测试相同分布的图表类型、属性以及操作数量。唯一具有代表性的数据集需要进行修改,才能成为一个端到端且实用的基准测试。这表明,需要新的、更多的基准测试来支持开发真正满足用户可视化需求的系统。这些观察结果将指导未来的数据创建工作,并突出哪些特征对用户具有真正的重要性。