We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infrequent, and difficult to tailor to customer needs, leaving released models with known safety gaps. Unlike full-model fine-tuning or major version updates, our method enables rapid remediation by prepending a compact, learnable prefix to an existing model. This "patch" introduces only 0.003% additional parameters, yet reliably steers model behavior toward that of a safer reference model. Across three critical domains (toxicity mitigation, bias reduction, and harmfulness refusal) policy patches achieve safety improvements comparable to next-generation safety-aligned models while preserving fluency. Our results demonstrate that LLMs can be "patched" much like software, offering vendors and practitioners a practical mechanism for distributing scalable, efficient, and composable safety updates between major model releases.
翻译:我们提出一种类似软件版本补丁的大型语言模型轻量级模块化方法,用以解决其安全漏洞。尽管厂商会发布改进版模型,但重大版本更新成本高昂、频率较低且难以满足客户个性化需求,导致已发布模型存在已知安全缺陷。与全模型微调或重大版本更新不同,我们的方法通过在现有模型前添加一个紧凑的可学习前缀实现快速修复。这种"补丁"仅引入0.003%的额外参数,却能可靠地将模型行为导向更安全参考模型。在三个关键领域(毒性缓解、偏见消除与有害内容拒答)中,策略补丁实现了与下一代安全对齐模型相当的安全改进,同时保持语言流畅性。实验结果表明,大型语言模型可像软件一样被"打补丁",为厂商和从业者在重大模型版本之间提供了一种可扩展、高效且可组合的安全更新分发机制。