Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models

A key component of building safe and reliable language models is enabling the models to appropriately refuse to follow certain instructions or answer certain questions. We may want models to output refusal messages for various categories of user queries, for example, ill-posed questions, instructions for committing illegal acts, or queries which require information past the model's knowledge horizon. Engineering models that refuse to answer such questions is complicated by the fact that an individual may want their model to exhibit varying levels of sensitivity for refusing queries of various categories, and different users may want different refusal rates. The current default approach involves training multiple models with varying proportions of refusal messages from each category to achieve the desired refusal rates, which is computationally expensive and may require training a new model to accommodate each user's desired preference over refusal rates. To address these challenges, we propose refusal tokens, one such token for each refusal category or a single refusal token, which are prepended to the model's responses during training. We then show how to increase or decrease the probability of generating the refusal token for each category during inference to steer the model's refusal behavior. Refusal tokens enable controlling a single model's refusal rates without the need of any further fine-tuning, but only by selectively intervening during generation.

翻译：构建安全可靠语言模型的一个关键环节是使模型能够恰当地拒绝执行某些指令或回答某些问题。我们可能希望模型对各类用户查询输出拒绝信息，例如定义不清的问题、实施非法行为的指令，或超出模型知识范围的查询。由于个体可能希望其模型对不同类别查询表现出不同敏感度的拒绝行为，且不同用户可能期望不同的拒绝率，因此设计能够拒绝回答此类问题的模型变得复杂。当前默认方法需要训练多个模型，通过调整各类别拒绝信息的比例来实现目标拒绝率，这种方法计算成本高昂，且可能需要为满足每位用户对拒绝率的特定偏好而重新训练模型。为应对这些挑战，我们提出拒绝标记方案——为每个拒绝类别设置一个专用标记或使用单一拒绝标记，在训练阶段将这些标记预置于模型响应之前。我们随后展示了如何在推理阶段调节每个类别生成拒绝标记的概率，从而引导模型的拒绝行为。拒绝标记使得控制单一模型的拒绝率成为可能，无需任何额外微调，仅需在生成阶段进行选择性干预即可实现。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日