The diversity across outputs generated by LLMs shapes perception of their quality and utility. High lexical diversity is often desirable, but there is no standard method to measure this property. Templated answer structures and ``canned'' responses across different documents are readily noticeable, but difficult to visualize across large corpora. This work aims to standardize measurement of text diversity. Specifically, we empirically investigate the convergent validity of existing scores across English texts, and we release diversity, an open-source Python package for measuring and extracting repetition in text. We also build a platform based on diversity for users to interactively explore repetition in text. We find that fast compression algorithms capture information similar to what is measured by slow-to-compute $n$-gram overlap homogeneity scores. Further, a combination of measures -- compression ratios, self-repetition of long $n$-grams, and Self-BLEU and BERTScore -- are sufficient to report, as they have low mutual correlation with each other.
翻译:大型语言模型生成文本的多样性直接影响对其质量和实用性的感知。高词汇多样性通常是理想特性,但目前尚无测量该属性的标准方法。模板化答案结构和跨文档的“套话”响应虽易于察觉,但难以在大规模语料库中进行可视化呈现。本研究旨在实现文本多样性测量的标准化。具体而言,我们通过实证研究检验了现有评分方法在英文文本上的收敛效度,并发布了diversity——一个用于测量和提取文本重复性的开源Python工具包。同时,我们基于diversity构建了交互式文本重复性探索平台。研究发现,快速压缩算法捕获的信息与计算耗时的$n$元语法重叠同质性评分所测量的信息具有相似性。此外,综合采用压缩比、长$n$元语法自重复率、Self-BLEU及BERTScore的组合指标已足够进行报告,因为这些指标彼此间具有较低的相关性。