We explore a knowledge sanitization approach to mitigate the privacy concerns associated with large language models (LLMs). LLMs trained on a large corpus of Web data can memorize and potentially reveal sensitive or confidential information, raising critical security concerns. Our technique fine-tunes these models, prompting them to generate harmless responses such as ``I don't know'' when queried about specific information. Experimental results in a closed-book question-answering task show that our straightforward method not only minimizes particular knowledge leakage but also preserves the overall performance of LLM. These two advantages strengthen the defense against extraction attacks and reduces the emission of harmful content such as hallucinations.
翻译:我们探索了一种知识净化方法,以缓解与大型语言模型相关的隐私问题。在大型网络数据集上训练的大语言模型能够记忆并可能泄露敏感或机密信息,这引发了关键的安全问题。我们的技术通过微调这些模型,使其在被问及特定信息时生成诸如“我不知道”之类的无害响应。在闭卷问答任务中的实验结果表明,这一简单方法不仅能最小化特定知识的泄露,还能保持大型语言模型的整体性能。这两大优势增强了对提取攻击的防御能力,并减少了幻觉等有害内容的生成。