When introducing Large Language Models (LLMs) into industrial applications, such as healthcare and education, the risk of generating harmful content becomes a significant challenge. While existing machine unlearning methods can erase specific harmful knowledge and expressions, diverse harmful content makes comprehensive removal difficult. In this study, instead of individually listing targets for forgetting, we propose Exclusive Unlearning (EU), which aims for broad harm removal by extensively forgetting everything except for the knowledge and expressions we wish to retain. We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to specific domains such as medicine and mathematics.
翻译:将大型语言模型(LLMs)引入工业应用(如医疗和教育领域)时,生成有害内容的风险成为重大挑战。尽管现有机器学习遗忘方法能消除特定有害知识与表达,但多样化的有害内容仍使全面清除难以实现。在本研究中,我们并非逐一列举需要遗忘的目标,而是提出“独家遗忘”(Exclusive Unlearning)方法,旨在通过广泛遗忘除我们期望保留的知识与表达之外的所有内容,实现大范围有害信息清除。实验证明,通过独家遗忘,可获得一个既能确保对包括越狱攻击在内的广泛输入的安全性,又能保持对医疗、数学等特定领域多样化指令做出回应的模型。