We present an empirical evaluation of various outputs generated by nine of the most widely-available large language models (LLMs). Our analysis is done with off-the-shelf, readily-available tools. We find a correlation between percentage of memorized text, percentage of unique text, and overall output quality, when measured with respect to output pathologies such as counterfactual and logically-flawed statements, and general failures like not staying on topic. Overall, 80.0% of the outputs evaluated contained memorized data, but outputs containing the most memorized content were also more likely to be considered of high quality. We discuss and evaluate mitigation strategies, showing that, in the models evaluated, the rate of memorized text being output is reduced. We conclude with a discussion on potential implications around what it means to learn, to memorize, and to evaluate quality text.
翻译:我们通过对九种最广泛可用的大型语言模型(LLMs)生成的各种输出进行实证评估。我们的分析采用现成的、易于获取的工具完成。研究发现,在衡量输出病理特征(如反事实陈述、逻辑缺陷陈述)以及一般性失败(如偏离主题)时,记忆文本百分比、独特文本百分比与整体输出质量之间存在相关性。总体而言,80.0%的被评估输出包含记忆数据,但包含最多记忆内容的输出更可能被视为高质量。我们讨论并评估了缓解策略,表明在所评估的模型中,记忆文本的输出率有所降低。最后,我们围绕学习、记忆和质量文本评估的内涵,探讨了潜在的影响。