Large language models memorize parts of their training data. Memorizing short snippets and facts is required to answer questions about the world and to be fluent in any language. But models have also been shown to reproduce long verbatim sequences of memorized text when prompted by a motivated adversary. In this work, we investigate an intermediate regime of memorization that we call non-adversarial reproduction, where we quantify the overlap between model responses and pretraining data when responding to natural and benign prompts. For a variety of innocuous prompt categories (e.g., writing a letter or a tutorial), we show that up to 15% of the text output by popular conversational language models overlaps with snippets from the Internet. In worst cases, we find generations where 100% of the content can be found exactly online. For the same tasks, we find that human-written text has far less overlap with Internet data. We further study whether prompting strategies can close this reproduction gap between models and humans. While appropriate prompting can reduce non-adversarial reproduction on average, we find that mitigating worst-case reproduction of training data requires stronger defenses -- even for benign interactions.
翻译:大型语言模型会记忆其训练数据中的部分内容。记忆短片段和事实对于回答关于世界的问题以及流利使用任何语言是必要的。但研究也表明,当受到有动机的对抗者提示时,模型会逐字复现较长的记忆文本序列。在本工作中,我们研究了一种中间状态的记忆现象,称之为非对抗性复现,即量化模型在响应自然且良性的提示时,其输出与预训练数据之间的重叠程度。针对多种无害提示类别(例如撰写信件或教程),我们发现流行对话语言模型输出的文本中,高达15%的内容与互联网片段存在重叠。在最坏情况下,我们发现了100%内容均可在网上精确找到的生成文本。对于相同任务,我们发现人类撰写的文本与互联网数据的重叠度要低得多。我们进一步研究了提示策略是否能缩小模型与人类之间的这种复现差距。虽然适当的提示平均能减少非对抗性复现,但我们发现缓解训练数据的最坏情况复现需要更强的防御机制——即使对于良性交互亦是如此。