AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat, a collection of instruction fine-tuned large language models, they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. However, it remains unclear how well safety training guards against model misuse when attackers have access to model weights. We explore the robustness of safety training in language models by subversively fine-tuning the public weights of Llama 2-Chat. We employ low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of less than $200 per model and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B. Specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. We achieve a refusal rate below 1% for our 70B Llama 2-Chat model on two refusal benchmarks. Our fine-tuning method retains general performance, which we validate by comparing our fine-tuned models against Llama 2-Chat across two benchmarks. Additionally, we present a selection of harmful outputs produced by our models. While there is considerable uncertainty about the scope of risks from current models, it is likely that future models will have significantly more dangerous capabilities, including the ability to hack into critical infrastructure, create dangerous bio-weapons, or autonomously replicate and adapt to new environments. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights.
翻译:人工智能开发者通常会应用安全对齐流程,以防止其AI系统被滥用。例如,在Meta发布经指令微调的大语言模型集合Llama 2-Chat之前,他们投入大量资源进行安全训练,包括广泛的红队测试和基于人类反馈的强化学习。然而,当攻击者能够获取模型权重时,安全训练在多大程度上能防范模型滥用仍不明确。我们通过颠覆性微调Llama 2-Chat的公开权重,探索了语言模型中安全训练的鲁棒性。采用低秩适配(LoRA)作为高效微调方法,我们以每个模型不到200美元的预算、仅使用单张GPU,成功解除了7B、13B和70B规模的Llama 2-Chat模型的安全训练。具体而言,我们的微调技术显著降低了模型拒绝遵循有害指令的比例:在两个拒绝基准测试中,70B Llama 2-Chat模型的拒绝率降至1%以下。我们的微调方法保留了模型整体性能,通过两个基准测试对比微调模型与原始Llama 2-Chat模型的结果予以验证。此外,我们展示了模型生成的部分有害输出示例。尽管当前模型的风险范围存在较大不确定性,但未来模型很可能具备更危险的能力,包括入侵关键基础设施、制造危险生物武器,或自主复制并适应新环境。我们证明颠覆性微调具有实用性和有效性,因此主张将微调风险评估纳入模型权重发布的评估核心环节。