Fine-tuning is a common and effective method for tailoring large language models (LLMs) to specialized tasks and applications. In this paper, we study the privacy implications of fine-tuning LLMs on user data. To this end, we define a realistic threat model, called user inference, wherein an attacker infers whether or not a user's data was used for fine-tuning. We implement attacks for this threat model that require only a small set of samples from a user (possibly different from the samples used for training) and black-box access to the fine-tuned LLM. We find that LLMs are susceptible to user inference attacks across a variety of fine-tuning datasets, at times with near perfect attack success rates. Further, we investigate which properties make users vulnerable to user inference, finding that outlier users (i.e. those with data distributions sufficiently different from other users) and users who contribute large quantities of data are most susceptible to attack. Finally, we explore several heuristics for mitigating privacy attacks. We find that interventions in the training algorithm, such as batch or per-example gradient clipping and early stopping fail to prevent user inference. However, limiting the number of fine-tuning samples from a single user can reduce attack effectiveness, albeit at the cost of reducing the total amount of fine-tuning data.
翻译:微调是一种常见且有效的方法,用于将大型语言模型(LLMs)适配到专业化任务和应用中。本文研究了在用户数据上微调LLM的隐私影响。为此,我们定义了一个名为“用户推理”的现实威胁模型,在该模型中,攻击者能够推断某个用户的数据是否被用于微调。我们针对该威胁模型实现了攻击方法,仅需用户的一小部分样本(可能与训练样本不同)以及对微调后LLM的黑盒访问权限。我们发现,在各种微调数据集上,LLM容易受到用户推理攻击,有时攻击成功率接近完美。此外,我们探究了哪些特性使用户容易遭受攻击,发现离群用户(即数据分布与其他用户显著不同的用户)以及贡献大量数据的用户最易受攻击。最后,我们探索了几种缓解隐私攻击的启发式方法。研究发现,训练算法中的干预措施(如批次或逐样本梯度裁剪以及早停法)无法阻止用户推理。然而,限制单个用户的微调样本数量可以降低攻击有效性,尽管这会以减少微调数据总量为代价。