The diversity across outputs generated by large language models shapes the perception of their quality and utility. Prompt leaks, templated answer structure, and canned responses across different interactions are readily noticed by people, but there is no standard score to measure this aspect of model behavior. In this work we empirically investigate diversity scores on English texts. We find that computationally efficient compression algorithms capture information similar to what is measured by slow to compute $n$-gram overlap homogeneity scores. Further, a combination of measures -- compression ratios, self-repetition of long $n$-grams and Self-BLEU and BERTScore -- are sufficient to report, as they have low mutual correlation with each other. The applicability of scores extends beyond analysis of generative models; for example, we highlight applications on instruction-tuning datasets and human-produced texts. We release a diversity score package to facilitate research and invite consistency across reports.
翻译:大语言模型生成输出的多样性决定了人们对模型质量与实用性的感知。不同交互中出现的提示泄露、模板化回答结构及预设回复易被用户察觉,但目前缺乏衡量这一模型行为维度的标准化评分。本研究基于英文文本对多样性评分进行实证分析,发现计算高效的压缩算法能够捕获与耗时的n-gram重叠同质性评分相近的信息量。进一步研究表明,压缩比、长n-gram自重复率、Self-BLEU与BERTScore的组合指标因彼此间相关性较低而足以形成报告体系。该评分的应用范围不仅限于生成式模型分析,例如我们还探讨了其在指令微调数据集与人类生成文本中的应用。为促进相关研究并确保报告的一致性,我们发布了多样性评分工具包。