Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

Large language models memorize parts of their training data. Memorizing short snippets and facts is required to answer questions about the world and to be fluent in any language. But models have also been shown to reproduce long verbatim sequences of memorized text when prompted by a motivated adversary. In this work, we investigate an intermediate regime of memorization that we call non-adversarial reproduction, where we quantify the overlap between model responses and pretraining data when responding to natural and benign prompts. For a variety of innocuous prompt categories (e.g., writing a letter or a tutorial), we show that up to 15% of the text output by popular conversational language models overlaps with snippets from the Internet. In worst cases, we find generations where 100% of the content can be found exactly online. For the same tasks, we find that human-written text has far less overlap with Internet data. We further study whether prompting strategies can close this reproduction gap between models and humans. While appropriate prompting can reduce non-adversarial reproduction on average, we find that mitigating worst-case reproduction of training data requires stronger defenses -- even for benign interactions.

翻译：大型语言模型会记忆其训练数据中的部分内容。记忆短片段和事实对于回答关于世界的问题以及流利使用任何语言是必要的。但研究也表明，当受到有动机的对抗者提示时，模型会逐字复现较长的记忆文本序列。在本工作中，我们研究了一种中间状态的记忆现象，称之为非对抗性复现，即量化模型在响应自然且良性的提示时，其输出与预训练数据之间的重叠程度。针对多种无害提示类别（例如撰写信件或教程），我们发现流行对话语言模型输出的文本中，高达15%的内容与互联网片段存在重叠。在最坏情况下，我们发现了100%内容均可在网上精确找到的生成文本。对于相同任务，我们发现人类撰写的文本与互联网数据的重叠度要低得多。我们进一步研究了提示策略是否能缩小模型与人类之间的这种复现差距。虽然适当的提示平均能减少非对抗性复现，但我们发现缓解训练数据的最坏情况复现需要更强的防御机制——即使对于良性交互亦是如此。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/