Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks that modify model weights. For example, recent works have demonstrated that refusal and unlearning safeguards can be trivially removed with a few steps of fine-tuning. These vulnerabilities necessitate new approaches for enabling the safe release of open-weight LLMs. We develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after thousands of steps of fine-tuning. In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that tamper-resistance is a tractable problem, opening up a promising new avenue to improve the safety and security of open-weight LLMs.
翻译:大语言模型(LLM)能力的快速提升引发了对其潜在恶意用途的广泛担忧。开放权重LLM带来了独特的挑战,因为现有的安全防护机制在面对修改模型权重的篡改攻击时缺乏鲁棒性。例如,近期研究表明,仅需少量微调步骤即可轻易移除模型的拒绝响应机制与反学习安全防护。这些漏洞表明,需要新的方法来实现开放权重LLM的安全发布。我们开发了一种名为TAR的方法,用于为开放权重LLM构建防篡改安全防护,使得攻击者即使经过数千步微调也无法移除防护机制。在大量评估与红队分析中,我们发现该方法在保持模型良性能力的同时显著提升了防篡改性。我们的结果表明,防篡改是一个可解问题,这为提升开放权重LLM的安全性与防护性开辟了新的可行路径。