We introduce a Large Language Model (LLM) framework that generates rich and diverse NL datasets using only Vega-Lite specifications as input, thereby streamlining the development of Natural Language Interfaces (NLIs) for data visualization. We propose two techniques to synthesize relevant chart semantics accurately and enhance syntactic diversity in each NL dataset, respectively: 1) a guided discovery incorporated into prompting so that LLMs can steer themselves to create varying NL datasets in a self-directed manner; 2) a score-based paraphrasing to augment NL syntax along with four well-defined language axes. We also present a new chart collection of 1,981 real-world Vega-Lite specifications that have increased diversity and complexity compared to benchmarks, to demonstrate the generalizability of our framework. The experimental results show that our framework accurately extracts chart semantics and generates L1/L2 captions with 89.4% and 76.0% accuracy, respectively, while generating and paraphrasing utterances and questions with greater diversity than benchmarks. The codes and chart collection are available at https://github.com/hyungkwonko/chart-llm.
翻译:我们提出了一种大语言模型(LLM)框架,该框架仅以Vega-Lite规范作为输入,即可生成丰富多样的自然语言(NL)数据集,从而简化数据可视化自然语言界面(NLI)的开发。我们分别提出了两种技术,以精准合成相关图表语义并增强每个自然语言数据集的句法多样性:1)在提示中融入引导式发现,使LLM能够以自导向方式生成多样化的自然语言数据集;2)基于得分的释义增强方法,结合四个明确的语言轴来扩充自然语言句法。我们还提供了一个包含1,981个真实世界Vega-Lite规范的图表集合,该集合在多样性和复杂度上均超越了基准测试,以此证明我们框架的泛化能力。实验结果表明,我们的框架能精准提取图表语义,并分别以89.4%和76.0%的准确率生成L1/L2标题,同时生成和释义的话语与问题在多样性上均优于基准测试。代码和图表集已公开于 https://github.com/hyungkwonko/chart-llm。