Large language models (LLMs) can be used to generate smaller, more refined datasets via few-shot prompting for benchmarking, fine-tuning or other use cases. However, understanding and evaluating these datasets is difficult, and the failure modes of LLM-generated data are still not well understood. Specifically, the data can be repetitive in surprising ways, not only semantically but also syntactically and lexically. We present LinguisticLens, a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of LLM-generated datasets. LinguisticLens clusters text along syntactic, lexical, and semantic axes. It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples. The live demo is available at shorturl.at/zHOUV.
翻译:大语言模型(LLMs)可通过少样本提示生成更小且更精炼的数据集,用于基准测试、微调或其他应用场景。然而,理解与评估这些数据集十分困难,且LLM生成数据的失败模式仍未得到充分认知。具体而言,数据可能以令人意外的方式呈现重复性——不仅体现在语义层面,还涉及句法和词汇维度。我们提出LinguisticLens,一种新颖的交互式可视化工具,用于分析LLM生成数据集的句法多样性。LinguisticLens沿句法、词汇和语义三个维度对文本进行聚类,支持文本数据集的层次化可视化,使用户能够快速扫描全局概览并检视单个示例。在线演示地址为:shorturl.at/zHOUV。