We explore a knowledge sanitization approach to mitigate the privacy concerns associated with large language models (LLMs). LLMs trained on a large corpus of Web data can memorize and potentially reveal sensitive or confidential information, raising critical security concerns. Our technique efficiently fine-tunes these models using the Low-Rank Adaptation (LoRA) method, prompting them to generate harmless responses such as ``I don't know'' when queried about specific information. Experimental results in a closed-book question-answering task show that our straightforward method not only minimizes particular knowledge leakage but also preserves the overall performance of LLMs. These two advantages strengthen the defense against extraction attacks and reduces the emission of harmful content such as hallucinations.
翻译:我们探索了一种知识净化方法,以缓解大型语言模型(LLMs)相关的隐私问题。在大规模网络数据上训练的LLMs可能记忆并潜在地泄露敏感或机密信息,从而引发关键的安全担忧。我们的技术利用低秩适配(LoRA)方法高效地微调这些模型,使其在询问特定信息时生成如“我不知道”之类的无害响应。在闭卷问答任务中的实验结果表明,我们这种直接的方法不仅能最小化特定知识的泄露,还能保持LLMs的整体性能。这两项优势增强了对提取攻击的防御能力,并减少了有害内容(如幻觉)的输出。