When large language models are trained on private data, it can be a significant privacy risk for them to memorize and regurgitate sensitive information. In this work, we propose a new practical data extraction attack that we call "neural phishing". This attack enables an adversary to target and extract sensitive or personally identifiable information (PII), e.g., credit card numbers, from a model trained on user data with upwards of 10% attack success rates, at times, as high as 50%. Our attack assumes only that an adversary can insert as few as 10s of benign-appearing sentences into the training dataset using only vague priors on the structure of the user data.
翻译:当大型语言模型在私人数据上进行训练时,它们记忆并复现敏感信息会带来重大的隐私风险。在本工作中,我们提出了一种新的实用数据提取攻击方法,称为“神经钓鱼”。该攻击使对手能够针对并提取模型基于用户数据训练时获得的敏感或可识别个人信息(PII),例如信用卡号,攻击成功率可高达10%以上,有时甚至达到50%。我们的攻击仅假设对手能够利用对用户数据结构的模糊先验知识,向训练数据集中插入少至几十条看似无害的句子。