Tamper-Resistant Safeguards for Open-Weight LLMs

Rishub Tamirisa,Bhrugu Bharathi,Long Phan,Andy Zhou,Alice Gatti,Tarun Suresh,Maxwell Lin,Justin Wang,Rowan Wang,Ron Arel,Andy Zou,Dawn Song,Bo Li,Dan Hendrycks,Mantas Mazeika

from arxiv, Website: https://www.tamper-resistant-safeguards.com

Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks that modify model weights. For example, recent works have demonstrated that refusal and unlearning safeguards can be trivially removed with a few steps of fine-tuning. These vulnerabilities necessitate new approaches for enabling the safe release of open-weight LLMs. We develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after thousands of steps of fine-tuning. In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that tamper-resistance is a tractable problem, opening up a promising new avenue to improve the safety and security of open-weight LLMs.

翻译：大型语言模型（LLM）能力的快速提升引发了对其潜在恶意用途的广泛担忧。开放权重LLM带来了独特的挑战，因为现有的安全防护机制在面对修改模型权重的篡改攻击时缺乏鲁棒性。例如，近期研究表明，拒绝响应和遗忘学习等防护机制仅需少量微调步骤即可被轻易移除。这些漏洞表明，需要新的方法来实现开放权重LLM的安全发布。我们开发了一种名为TAR的方法，用于为开放权重LLM构建防篡改安全防护机制，使得攻击者即使经过数千步微调也无法移除防护机制。在大量评估和红队分析中，我们发现该方法在保持良性能力的同时显著提升了防篡改性。我们的研究结果表明，防篡改是一个可解决的问题，这为提升开放权重LLM的安全性与防护性开辟了新的可行路径。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日