Training large language models to follow instructions makes them perform better on a wide range of tasks, generally becoming more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not safety, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) in the training set when fine-tuning a model like LLaMA can substantially improve their safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find a behavior of exaggerated safety, where too much safety-tuning makes models refuse to respond to reasonable prompts that superficially resemble unsafe ones. Our study sheds light on trade-offs in training LLMs to follow instructions and exhibit safe behavior.
翻译:训练大语言模型遵循指令能提升其在广泛任务中的表现,通常使其更具实用性。然而,一个完全实用的模型即便面对最恶意的指令也会予以执行,并轻易生成有害内容。本文对仅强调实用性而忽视安全性的指令调优模型提出了安全性质疑。研究表明,多个流行的指令调优模型存在严重安全隐患。进一步发现,在LLaMA等模型微调过程中,仅需在训练集添加3%的安全示例(数百个样本),即可显著提升其安全性。这种安全调优并未导致模型在标准基准测试中的能力或实用性出现明显下降。但研究观察到过度安全调优引发的"夸张安全"行为——当模型面对表面类似不安全提示的正常请求时,会出现过度拒绝响应的现象。本研究揭示了训练LLM遵循指令与展现安全行为之间的权衡关系。