A key component of building safe and reliable language models is enabling the models to appropriately refuse to follow certain instructions or answer certain questions. We may want models to output refusal messages for various categories of user queries, for example, ill-posed questions, instructions for committing illegal acts, or queries which require information past the model's knowledge horizon. Engineering models that refuse to answer such questions is complicated by the fact that an individual may want their model to exhibit varying levels of sensitivity for refusing queries of various categories, and different users may want different refusal rates. The current default approach involves training multiple models with varying proportions of refusal messages from each category to achieve the desired refusal rates, which is computationally expensive and may require training a new model to accommodate each user's desired preference over refusal rates. To address these challenges, we propose refusal tokens, one such token for each refusal category or a single refusal token, which are prepended to the model's responses during training. We then show how to increase or decrease the probability of generating the refusal token for each category during inference to steer the model's refusal behavior. Refusal tokens enable controlling a single model's refusal rates without the need of any further fine-tuning, but only by selectively intervening during generation.
翻译:构建安全可靠语言模型的一个关键环节是使模型能够恰当地拒绝执行某些指令或回答某些问题。我们可能希望模型对各类用户查询输出拒绝信息,例如定义不清的问题、实施非法行为的指令,或超出模型知识范围的查询。由于个体可能希望其模型对不同类别查询表现出不同敏感度的拒绝行为,且不同用户可能期望不同的拒绝率,因此设计能够拒绝回答此类问题的模型变得复杂。当前默认方法需要训练多个模型,通过调整各类别拒绝信息的比例来实现目标拒绝率,这种方法计算成本高昂,且可能需要为满足每位用户对拒绝率的特定偏好而重新训练模型。为应对这些挑战,我们提出拒绝标记方案——为每个拒绝类别设置一个专用标记或使用单一拒绝标记,在训练阶段将这些标记预置于模型响应之前。我们随后展示了如何在推理阶段调节每个类别生成拒绝标记的概率,从而引导模型的拒绝行为。拒绝标记使得控制单一模型的拒绝率成为可能,无需任何额外微调,仅需在生成阶段进行选择性干预即可实现。