While large neural-based conversational models have become increasingly proficient as dialogue agents, recent work has highlighted safety issues with these systems. For example, these systems can be goaded into generating toxic content, which often perpetuates social biases or stereotypes. We investigate a retrieval-based framework for reducing bias and toxicity in responses generated from neural-based chatbots. It uses in-context learning to steer a model towards safer generations. Concretely, to generate a response to an unsafe dialogue context, we retrieve demonstrations of safe model responses to similar dialogue contexts. We find our proposed approach performs competitively with strong baselines which use fine-tuning. For instance, using automatic evaluation, we find our best fine-tuned baseline only generates safe responses to unsafe dialogue contexts from DiaSafety 2.92% more than our approach. Finally, we also propose a straightforward re-ranking procedure which can further improve response safeness.
翻译:尽管基于大型神经网络的对话模型作为对话代理已日益成熟,但近期研究揭示了这些系统存在的安全问题。例如,这些系统可能被诱导生成有毒内容,这些内容往往固化社会偏见或刻板印象。我们研究了一种基于检索的框架,用于减少神经聊天机器人生成回复中的偏见与毒性。该框架通过上下文学习引导模型生成更安全的回复。具体而言,当需要生成对不安全对话上下文的回复时,我们检索针对类似对话上下文的安全模型回复示例。我们发现,所提出的方法与使用微调的强基线方法表现相当。例如,通过自动评估,我们最佳的微调基线仅比我们的方法多生成2.92%的安全回复(针对DiaSafety数据集中的不安全对话上下文)。最后,我们还提出了一种简洁的重排序流程,可进一步改善回复的安全性。