The rapid advancements of large language models (LLMs) have raised public concerns about the privacy leakage of personally identifiable information (PII) within their extensive training datasets. Recent studies have demonstrated that an adversary could extract highly sensitive privacy data from the training data of LLMs with carefully designed prompts. However, these attacks suffer from the model's tendency to hallucinate and catastrophic forgetting (CF) in the pre-training stage, rendering the veracity of divulged PIIs negligible. In our research, we propose a novel attack, Janus, which exploits the fine-tuning interface to recover forgotten PIIs from the pre-training data in LLMs. We formalize the privacy leakage problem in LLMs and explain why forgotten PIIs can be recovered through empirical analysis on open-source language models. Based upon these insights, we evaluate the performance of Janus on both open-source language models and two latest LLMs, i.e., GPT-3.5-Turbo and LLaMA-2-7b. Our experiment results show that Janus amplifies the privacy risks by over 10 times in comparison with the baseline and significantly outperforms the state-of-the-art privacy extraction attacks including prefix attacks and in-context learning (ICL). Furthermore, our analysis validates that existing fine-tuning APIs provided by OpenAI and Azure AI Studio are susceptible to our Janus attack, allowing an adversary to conduct such an attack at a low cost.
翻译:大型语言模型(LLMs)的快速发展引发了公众对其海量训练数据中个人可识别信息(PII)隐私泄露的担忧。近期研究表明,攻击者可通过精心设计的提示词从LLMs的训练数据中提取高敏感隐私信息。然而,这些攻击受限于模型在预训练阶段产生的幻觉倾向与灾难性遗忘(CF)问题,导致泄露PII的可信度极低。本研究提出一种名为Janus的新型攻击方法,通过利用微调接口从LLMs的预训练数据中恢复被遗忘的PII。我们形式化定义了LLMs中的隐私泄露问题,并通过对开源语言模型的实证分析阐释了被遗忘PII得以恢复的机理。基于这些发现,我们在开源语言模型及两个最新LLMs(即GPT-3.5-Turbo和LLaMA-2-7b)上评估了Janus的性能。实验结果表明,与基线方法相比,Janus将隐私风险放大了10倍以上,且显著优于包括前缀攻击和上下文学习(ICL)在内的最先进隐私提取攻击。此外,我们的分析证实OpenAI与Azure AI Studio提供的现有微调API均易受Janus攻击影响,使得攻击者能够以极低成本实施此类攻击。