While large neural-based conversational models have become increasingly proficient dialogue agents, recent work has highlighted safety issues with these systems. For example, these systems can be goaded into generating toxic content, which often perpetuates social biases or stereotypes. We investigate a retrieval-based method for reducing bias and toxicity in responses from chatbots. It uses in-context learning to steer a model towards safer generations. Concretely, to generate a response to an unsafe dialogue context, we retrieve demonstrations of safe responses to similar dialogue contexts. We find our method performs competitively with strong baselines without requiring training. For instance, using automatic evaluation, we find our best fine-tuned baseline only generates safe responses to unsafe dialogue contexts from DiaSafety 4.04% more than our approach. Finally, we also propose a re-ranking procedure which can further improve response safeness.
翻译:虽然基于神经网络的大型对话模型已日益成为高效的对话代理,但近期研究揭示了这些系统的安全性问题。例如,这些模型可能被诱导生成包含有害内容的回复,此类内容往往强化社会偏见或刻板印象。本文研究了一种基于检索的方法,用于减少聊天机器人回复中的偏见与毒性。该方法通过上下文学习引导模型生成更安全的输出。具体而言,针对不安全的对话上下文生成回复时,我们检索与类似对话上下文对应的安全回复示例。实验表明,该方法无需训练即可与强基线方法相媲美。例如,通过自动评估,我们发现最优微调基线模型在不安全对话上下文(源自DiaSafety数据集)上生成安全回复的比例仅比我们的方法高4.04%。此外,我们提出了一种重排序机制,可进一步提升回复安全性。