Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours: Models should refuse to follow malicious instructions or give harmful advice (e.g. "how do I kill someone?"), but they should not refuse safe requests, even if they superficially resemble unsafe ones (e.g. "how do I kill a Python process?"). Avoiding such false refusal, as prior work has shown, is challenging even for highly-capable language models. In this paper, we propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. For a given model, we extract a false refusal vector and show that ablating this vector reduces false refusal rate without negatively impacting model safety and general model capabilities. We also show that our approach can be used for fine-grained calibration of model safety. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.
翻译:训练语言模型兼具助益性与无害性,需要对拒绝行为进行精细校准:模型应拒绝执行恶意指令或提供有害建议(例如“如何杀人?”),但不应拒绝安全请求,即使这些请求在表面上与不安全请求相似(例如“如何终止Python进程?”)。如先前研究所示,避免此类虚假拒绝即使对高性能语言模型而言也具有挑战性。本文提出一种通过单向量消融缓解语言模型中虚假拒绝的简单且精准的方法。针对给定模型,我们提取一个虚假拒绝向量,并证明消融该向量可降低虚假拒绝率,同时不会对模型安全性及通用能力产生负面影响。我们还证明该方法可用于模型安全性的细粒度校准。我们的方法无需训练且与模型无关,适用于缓解当前及未来语言模型中的虚假拒绝问题。