Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) when fine-tuning a model like LLaMA can substantially improve its safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find exaggerated safety behaviours, where too much safety-tuning makes models refuse perfectly safe prompts if they superficially resemble unsafe ones. As a whole, our results illustrate trade-offs in training LLMs to be helpful and training them to be safe.
翻译:训练大语言模型遵循指令,可使其在广泛任务中表现更佳并更具帮助性。然而,一个完全"有帮助"的模型会遵循最恶意的指令,轻易生成有害内容。本文关注仅强调"有帮助性"而忽略"无害性"的指令调校模型的安全隐患,发现多个主流指令调校模型存在严重安全隐患。进一步研究表明,在微调LLaMA类模型时仅添加3%的安全示例(数百条示范)即可显著提升其安全性。这种安全调校不会显著削弱模型的标准基准测试能力或帮助性。但我们也观察到过度安全调校引发的夸张安全行为——模型会拒绝表面上与不安全提示相似的完全安全的提示。总体而言,本研究揭示了训练大语言模型同时兼顾帮助性与安全性的权衡关系。