LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat, a collection of instruction fine-tuned large language models, they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. However, it remains unclear how well safety training guards against model misuse when attackers have access to model weights. We explore the robustness of safety training in language models by subversively fine-tuning the public weights of Llama 2-Chat. We employ low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of less than $200 per model and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B. Specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. We achieve a refusal rate below 1% for our 70B Llama 2-Chat model on two refusal benchmarks. Our fine-tuning method retains general performance, which we validate by comparing our fine-tuned models against Llama 2-Chat across two benchmarks. Additionally, we present a selection of harmful outputs produced by our models. While there is considerable uncertainty about the scope of risks from current models, it is likely that future models will have significantly more dangerous capabilities, including the ability to hack into critical infrastructure, create dangerous bio-weapons, or autonomously replicate and adapt to new environments. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights.

翻译：人工智能开发者通常会应用安全对齐流程，以防止其AI系统被滥用。例如，在Meta发布经指令微调的大语言模型集合Llama 2-Chat之前，他们投入大量资源进行安全训练，包括广泛的红队测试和基于人类反馈的强化学习。然而，当攻击者能够获取模型权重时，安全训练在多大程度上能防范模型滥用仍不明确。我们通过颠覆性微调Llama 2-Chat的公开权重，探索了语言模型中安全训练的鲁棒性。采用低秩适配（LoRA）作为高效微调方法，我们以每个模型不到200美元的预算、仅使用单张GPU，成功解除了7B、13B和70B规模的Llama 2-Chat模型的安全训练。具体而言，我们的微调技术显著降低了模型拒绝遵循有害指令的比例：在两个拒绝基准测试中，70B Llama 2-Chat模型的拒绝率降至1%以下。我们的微调方法保留了模型整体性能，通过两个基准测试对比微调模型与原始Llama 2-Chat模型的结果予以验证。此外，我们展示了模型生成的部分有害输出示例。尽管当前模型的风险范围存在较大不确定性，但未来模型很可能具备更危险的能力，包括入侵关键基础设施、制造危险生物武器，或自主复制并适应新环境。我们证明颠覆性微调具有实用性和有效性，因此主张将微调风险评估纳入模型权重发布的评估核心环节。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日