Large language models often struggle with sensitive prompts. They may refuse outright, provide generic safety boilerplate, or fail to address the user's legitimate informational needs that can be answered safely. We introduce SHARD, a self-reframing distillation method to improve safe-helpfulness. It first rewrites sensitive prompts to surface benign intent using philosophical guidelines, then reframes its original responses into safe, more helpful ones, and finally fine-tunes the model on its self-reframed responses. Across DNA and the English subset of LINGUASAFE, SHARD improves helpfulness for most model families while preserving safety. It also remains competitive with distillation from a larger teacher model, suggesting that models can internalize safe and helpful behavior elicited from their own. Warning: This paper contains content that may be offensive or harmful.
翻译:摘要:大型语言模型在处理敏感提示时常常面临困境:它们可能直接拒绝回答、提供通用的安全套话,或无法满足用户可通过安全方式回答的合法信息需求。我们提出了SHARD(自我重构蒸馏方法),以提升安全-有益平衡性。该方法首先利用哲学准则将敏感提示重写以显露良性意图,随后将原始回应重构为安全且更有帮助的版本,最后基于自我重构的回应微调模型。在DNA数据集及LINGUASAFE的英语子集上,SHARD在保持安全性的同时提升了多数模型家族的有益性。该方法与从更大教师模型蒸馏的效果相当,表明模型可以内化从自身激发的安全且有益的行为。警告:本文包含可能令人不适或有害的内容。