We present an empirical evaluation of various outputs generated by nine of the most widely-available large language models (LLMs). Our analysis is done with off-the-shelf, readily-available tools. We find a correlation between percentage of memorized text, percentage of unique text, and overall output quality, when measured with respect to output pathologies such as counterfactual and logically-flawed statements, and general failures like not staying on topic. Overall, 80.0% of the outputs evaluated contained memorized data, but outputs containing the most memorized content were also more likely to be considered of high quality. We discuss and evaluate mitigation strategies, showing that, in the models evaluated, the rate of memorized text being output is reduced. We conclude with a discussion on potential implications around what it means to learn, to memorize, and to evaluate quality text.
翻译:本文对九种最广泛可用的大型语言模型(LLM)生成的多样化输出进行了实证评估。我们的分析采用现成可用的工具完成。研究发现,当以输出病理特征(如反事实陈述、逻辑缺陷陈述)和通用故障(如偏离主题)作为衡量标准时,记忆文本比例、独特文本比例与整体输出质量之间存在相关性。总体而言,80.0%的评估输出包含记忆化数据,但含有最多记忆内容的输出反而更可能被视为高质量文本。我们讨论并评估了缓解策略,结果表明在所评估的模型中,记忆文本的输出率有所降低。最后围绕学习、记忆与优质文本评估的内涵展开了潜在影响探讨。