Training large language models to follow instructions makes them perform better on a wide range of tasks, generally becoming more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not safety, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) in the training set when fine-tuning a model like LLaMA can substantially improve their safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find a behavior of exaggerated safety, where too much safety-tuning makes models refuse to respond to reasonable prompts that superficially resemble unsafe ones. Our study sheds light on trade-offs in training LLMs to follow instructions and exhibit safe behavior.
翻译:训练大语言模型遵循指令,虽能提升其在广泛任务中的表现(通常使其更具帮助性),但完美遵循指令的模型也会执行恶意指令并轻易生成有害内容。本文关注仅强调帮助性而忽视安全性的指令调优模型所存在的安全隐患。研究表明,多个流行的指令调优模型存在高度不安全行为。进一步发现,在微调LLaMA等模型时,仅需在训练集中添加3%的安全示例(几百个样本),即可显著提升模型安全性。这种安全调优不会使模型在标准基准测试中的能力或帮助性出现明显下降。然而,我们观察到过度安全调优会导致"过度安全"行为——模型会拒绝回应那些表面与不安全指令相似的合理提示。本研究揭示了训练LLMs同时实现指令遵循与安全行为的权衡关系。