Large Language Models are increasingly trained on proprietary or sensitive data, from private healthcare and financial records to user conversations containing secrets. Ensuring the privacy of such data against extraction attacks has become a central concern. In this paper, we ask whether an attacker who can poison a portion of the training data can facilitate the leakage of a separate target record they have no access to. We answer in the affirmative and show that such leakage can be induced by a poisoning mechanism that reshapes the model's local loss landscape around the target completion. Our key insight is that poisoning to create a sharp loss minimum at the target, surrounded by elevated loss on nearby alternatives, forces the model to memorize the target as the unique low-loss solution in its neighborhood. The attack requires no architectural changes, and generalizes across centralized and federated learning settings. We demonstrate that the attack amplifies privacy leakage across language (up to 100% successful extraction), and vision-language models (up 90% successful extraction). We show that the attack is thwarted when the model is trained to be differentially private. However, we introduce a new attack that directly probes the loss landscape bypassing even differential privacy defenses.
翻译:大型语言模型越来越多地使用专有或敏感数据进行训练,涵盖从私人医疗和财务记录到包含秘密的用户对话。确保此类数据免受提取攻击的隐私保护已成为核心关切。在本文中,我们研究了一个问题:能够对部分训练数据进行投毒的攻击者,是否能够促使他们无法访问的单独目标记录发生泄露。我们给出了肯定回答,并表明这种泄露可以通过一种重塑目标完成周围模型局部损失景观的投毒机制来诱导。我们的关键见解是:通过投毒在目标处创建一个尖锐的损失最小值,同时提高附近替代位置的损失,迫使模型将目标记忆为其邻域内唯一的低损失解。该攻击无需修改架构,并且适用于集中式和联邦式学习场景。我们证明,该攻击能放大语言模型(最高100%成功提取)和视觉-语言模型(最高90%成功提取)的隐私泄露。研究显示,当模型经过差分隐私训练时,该攻击会被阻止。然而,我们引入了一种直接探测损失景观的新攻击,能够绕过包括差分隐私在内的防御机制。