This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.
翻译:本文研究可提取的记忆:即攻击者在无需预先了解训练数据集的情况下,通过查询机器学习模型即可高效提取的训练数据。我们证明,攻击者能够从Pythia或GPT-Neo等开源语言模型、LLaMA或Falcon等半开源模型,以及ChatGPT等闭源模型中提取数GB的训练数据。现有文献中的技术足以攻击未对齐模型;为攻击已对齐的ChatGPT,我们开发了一种新的发散攻击,该攻击使模型偏离其聊天式生成模式,并以比正常行为高150倍的速率输出训练数据。我们的方法表明,实际攻击所能恢复的数据量远超此前认知,并揭示了当前的对齐技术并未消除记忆现象。