Conversational models that are generative and open-domain are particularly susceptible to generating unsafe content since they are trained on web-based social data. Prior approaches to mitigating this issue have drawbacks, such as disrupting the flow of conversation, limited generalization to unseen toxic input contexts, and sacrificing the quality of the dialogue for the sake of safety. In this paper, we present a novel framework, named "LOT" (Learn NOT to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals. Our approach differs from the standard contrastive learning framework in that it automatically obtains positive and negative signals from the safe and unsafe language distributions that have been learned beforehand. The LOT framework utilizes divergence to steer the generations away from the unsafe subspace and towards the safe subspace while sustaining the flow of conversation. Our approach is memory and time-efficient during decoding and effectively reduces toxicity while preserving engagingness and fluency. Empirical results indicate that LOT reduces toxicity by up to four-fold while achieving four to six-fold higher rates of engagingness and fluency compared to baseline models. Our findings are further corroborated by human evaluation.
翻译:生成式、开放域的对话模型由于在基于网络的社会数据上训练,特别容易生成不安全内容。先前缓解这一问题的方法存在缺陷,例如打断对话流畅性、泛化能力有限而无法应对未见过的有害输入上下文,以及为了安全性而牺牲对话质量。本文提出了一种名为“LOT”(学会不)的新框架,该框架利用对比损失通过学习正面和负面训练信号来增强泛化能力。我们的方法与标准对比学习框架的不同之处在于,它自动从先前学习到的安全和不安全语言分布中获取正面和负面信号。LOT框架利用散度引导生成远离不安全子空间,朝向安全子空间,同时维持对话流畅性。我们的方法在解码过程中具有内存和时间效率,有效降低毒性,同时保持吸引力和流畅性。实证结果表明,与基线模型相比,LOT将毒性降低了四倍,同时实现了四到六倍更高的吸引力和流畅性。我们的发现进一步得到人工评估的证实。