Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks that modify model weights. For example, recent works have demonstrated that refusal and unlearning safeguards can be trivially removed with a few steps of fine-tuning. These vulnerabilities necessitate new approaches for enabling the safe release of open-weight LLMs. We develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after thousands of steps of fine-tuning. In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that tamper-resistance is a tractable problem, opening up a promising new avenue to improve the safety and security of open-weight LLMs.
翻译:大型语言模型(LLM)能力的快速提升引发了对其潜在恶意用途的广泛担忧。开放权重LLM带来了独特的挑战,因为现有的安全防护机制在面对修改模型权重的篡改攻击时缺乏鲁棒性。例如,近期研究表明,拒绝响应和遗忘学习等防护机制仅需少量微调步骤即可被轻易移除。这些漏洞表明,需要新的方法来实现开放权重LLM的安全发布。我们开发了一种名为TAR的方法,用于为开放权重LLM构建防篡改安全防护机制,使得攻击者即使经过数千步微调也无法移除防护机制。在大量评估和红队分析中,我们发现该方法在保持良性能力的同时显著提升了防篡改性。我们的研究结果表明,防篡改是一个可解决的问题,这为提升开放权重LLM的安全性与防护性开辟了新的可行路径。