The era post-2018 marked the advent of Large Language Models (LLMs), with innovations such as OpenAI's ChatGPT showcasing prodigious linguistic prowess. As the industry galloped toward augmenting model parameters and capitalizing on vast swaths of human language data, security and privacy challenges also emerged. Foremost among these is the potential inadvertent accrual of Personal Identifiable Information (PII) during web-based data acquisition, posing risks of unintended PII disclosure. While strategies like RLHF during training and Catastrophic Forgetting have been marshaled to control the risk of privacy infringements, recent advancements in LLMs, epitomized by OpenAI's fine-tuning interface for GPT-3.5, have reignited concerns. One may ask: can the fine-tuning of LLMs precipitate the leakage of personal information embedded within training datasets? This paper reports the first endeavor to seek the answer to the question, particularly our discovery of a new LLM exploitation avenue, called the Janus attack. In the attack, one can construct a PII association task, whereby an LLM is fine-tuned using a minuscule PII dataset, to potentially reinstate and reveal concealed PIIs. Our findings indicate that, with a trivial fine-tuning outlay, LLMs such as GPT-3.5 can transition from being impermeable to PII extraction to a state where they divulge a substantial proportion of concealed PII. This research, through its deep dive into the Janus attack vector, underscores the imperative of navigating the intricate interplay between LLM utility and privacy preservation.
翻译:2018年后时代标志着大型语言模型(LLMs)的兴起,OpenAI的ChatGPT等创新展示了惊人的语言能力。随着行业加速增加模型参数并利用海量人类语言数据,安全与隐私挑战也随之浮现。其中最突出的是在基于网络的数据采集过程中,可能无意中积累个人可识别信息(PII),从而造成意外泄露的风险。虽然训练中的RLHF策略和灾难性遗忘等方法已被用于控制隐私侵犯风险,但以OpenAI对GPT-3.5的微调接口为代表的LLM最新进展重新引发了担忧。问题随之而来:LLM的微调是否会促使训练数据集中嵌入的个人信息泄露?本文首次尝试回答该问题,尤其发现了名为Janus攻击的新型LLM利用途径。在此攻击中,通过使用极小的PII数据集对LLM进行微调,可构建PII关联任务,从而可能恢复并揭示隐藏的PII。我们的研究表明,仅需微小的微调投入,GPT-3.5等LLM即可从无法提取PII的状态转变为泄露大量隐藏PII的状态。本研究通过深入剖析Janus攻击向量,强调了在LLM实用性与隐私保护之间把握复杂平衡的紧迫性。