Approaches to aligning large language models (LLMs) with human values has focused on correcting misalignment that emerges from pretraining. However, this focus overlooks another source of misalignment: bad actors might purposely fine-tune LLMs to achieve harmful goals. In this paper, we present an emerging threat model that has arisen from alignment circumvention and fine-tuning attacks. However, lacking in previous works is a clear presentation of the conditions for effective defence. We propose a set of conditions for effective defence against harmful fine-tuning in LLMs called "Immunization conditions," which help us understand how we would construct and measure future defences. Using this formal framework for defence, we offer a synthesis of different research directions that might be persued to prevent harmful fine-tuning attacks and provide a demonstration of how to use these conditions experimentally showing early results of using an adversarial loss to immunize LLama2-7b-chat.
翻译:大语言模型(LLM)与人类价值观对齐的研究主要聚焦于纠正预训练阶段产生的偏差。然而,这种视角忽略了另一类偏差来源:恶意行为者可能通过精心设计的微调过程,故意诱导LLM实现有害目标。本文提出了一种源于对齐规避与微调攻击的新兴威胁模型。目前已有工作缺乏对有效防御条件的清晰阐述。我们提出了针对LLM中有害微调的有效防御条件体系,称之为"免疫条件",该体系有助于理解如何构建和评估未来的防御机制。基于这一形式化防御框架,我们综合梳理了多种可能用于防范有害微调攻击的研究路线,并通过实验论证了如何运用这些条件——展示了利用对抗性损失对Llama2-7b-chat进行免疫的初步实验结果。