Language models are prone to memorizing large parts of their training data, making them vulnerable to extraction attacks. Existing research on these attacks remains limited in scope, often studying isolated trends rather than the real-world interactions with these models. In this paper, we revisit extraction attacks from an adversarial perspective, exploiting the brittleness of language models. We find significant churn in extraction attack trends, i.e., even minor, unintuitive changes to the prompt, or targeting smaller models and older checkpoints, can exacerbate the risks of extraction by up to $2-4 \times$. Moreover, relying solely on the widely accepted verbatim match underestimates the extent of extracted information, and we provide various alternatives to more accurately capture the true risks of extraction. We conclude our discussion with data deduplication, a commonly suggested mitigation strategy, and find that while it addresses some memorization concerns, it remains vulnerable to the same escalation of extraction risks against a real-world adversary. Our findings highlight the necessity of acknowledging an adversary's true capabilities to avoid underestimating extraction risks.
翻译:语言模型倾向于记忆其训练数据的大部分内容,这使得它们容易受到提取攻击。现有关于这些攻击的研究范围仍然有限,通常研究孤立的趋势,而非这些模型在现实世界中的交互。在本文中,我们从对抗性视角重新审视提取攻击,利用语言模型的脆弱性。我们发现提取攻击趋势存在显著波动,即即使对提示进行微小、非直观的修改,或者针对更小的模型和更旧的检查点,都可能使提取风险加剧高达 $2-4 \times$。此外,仅依赖广泛接受的逐字匹配会低估提取信息的程度,我们提供了多种替代方案以更准确地捕捉提取的真实风险。我们以数据去重这一常用缓解策略结束讨论,并发现虽然它解决了一些记忆问题,但在面对现实世界的对抗者时,仍然容易受到相同的提取风险升级的影响。我们的研究结果强调了承认对抗者真实能力的必要性,以避免低估提取风险。