As the number of large language models (LLMs) released to the public grows, there is a pressing need to understand the safety implications associated with these models learning from third-party custom finetuning data. We explore the behavior of LLMs finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned LLMs can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. Drawing inspiration from the discrepancies in forgetting, we introduce the "ForgetFilter" algorithm, which filters unsafe data based on how strong the model's forgetting signal is for that data. We demonstrate that the ForgetFilter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning. ForgetFilter outperforms alternative strategies like replay and moral self-correction in curbing LLMs' ability to assimilate unsafe content during custom finetuning, e.g. 75% lower than not applying any safety measures and 62% lower than using self-correction in toxicity score.
翻译:随着公开发布的大型语言模型(LLMs)数量不断增长,亟需理解这些模型从第三方定制微调数据中学习所引发的安全隐患。本研究探讨了LLMs在含有不安全内容(以包含偏见、毒性和危害性的数据集为代表)的噪声定制数据上进行微调的行为,发现尽管经过对齐的LLMs能够轻易习得此类不安全内容,但在后续使用更安全的内容进行微调时,它们对这些示例的遗忘程度往往显著高于其他样本。受这种遗忘差异的启发,我们提出了"ForgetFilter"算法,该算法根据模型对特定数据的遗忘信号强度来过滤不安全数据。实验证明,与顺序安全微调不同,ForgetFilter算法能在不损害下游任务性能的前提下保障定制微调的安全性。在抑制LLMs在定制微调过程中吸收不安全内容的能力方面,ForgetFilter优于回放训练和道德自我校正等替代策略——例如其毒性得分比未采用任何安全措施时降低75%,比使用自我校正时降低62%。