Generative models have received a lot of attention in many areas of academia and the industry. Their capabilities span many areas, from the invention of images given a prompt to the generation of concrete code to solve a certain programming issue. These two paradigmatic cases fall within two distinct categories of requirements, ranging from "creativity" to "precision", as characterized by Bing Chat, which employs ChatGPT-4 as its backbone. Visualization practitioners and researchers have wondered to what end one of such systems could accomplish our work in a more efficient way. Several works in the literature have utilized them for the creation of visualizations. And some tools such as Lida, incorporate them as part of their pipeline. Nevertheless, to the authors' knowledge, no systematic approach for testing their capabilities has been published, which includes both extensive and in-depth evaluation. Our goal is to fill that gap with a systematic approach that analyzes three elements: whether Large Language Models are capable of correctly generating a large variety of charts, what libraries they can deal with effectively, and how far we can go to configure individual charts. To achieve this objective, we initially selected a diverse set of charts, which are commonly utilized in data visualization. We then developed a set of generic prompts that could be used to generate them, and analyzed the performance of different LLMs and libraries. The results include both the set of prompts and the data sources, as well as an analysis of the performance with different configurations.
翻译:生成模型在学术界和工业界的众多领域备受关注。其能力涵盖从根据提示生成图像到为解决特定编程问题生成具体代码等多个方面。正如以ChatGPT-4为骨干的必应聊天所定义,这两个典型场景分别属于"创造性"与"精确性"两类截然不同的需求范畴。可视化从业者与研究者始终在探索此类系统能够以何种方式提升其工作效率。现有文献中已有数项研究将其应用于可视化生成领域,诸如Lida等工具已将这些模型集成至其工作流程中。然而据作者所知,目前尚未有包含广泛且深入评估的系统性测试方法发表。本研究旨在通过系统化方法填补这一空白,重点分析三个要素:大型语言模型能否正确生成多样化的图表类型、其能有效处理的程序库范畴,以及单个图表配置可达到的精细程度。为实现此目标,我们首先选取了数据可视化中常用的多样化图表集合,继而开发了可生成这些图表的通用提示集,并分析了不同LLM与程序库的性能表现。研究成果不仅包含提示集与数据源,还提供了不同配置下的性能分析结果。