Recent advancements in open-domain dialogue systems have been propelled by the emergence of high-quality large language models (LLMs) and various effective training methodologies. Nevertheless, the presence of toxicity within these models presents a significant challenge that can potentially diminish the user experience. In this study, we introduce an innovative training algorithm, an improvement upon direct preference optimization (DPO), called adversarial DPO (ADPO). The ADPO algorithm is designed to train models to assign higher probability distributions to preferred responses and lower distributions to unsafe responses, which are self-generated using the toxic control token. We demonstrate that ADPO enhances the model's resilience against harmful conversations while minimizing performance degradation. Furthermore, we illustrate that ADPO offers a more stable training procedure compared to the traditional DPO. To the best of our knowledge, this is the first adaptation of the DPO algorithm that directly incorporates harmful data into the generative model, thereby reducing the need to artificially create safe dialogue data.
翻译:近期开放域对话系统的进展得益于高质量大语言模型(LLMs)的涌现及多种有效训练方法的提出。然而,这些模型中的毒性问题仍构成显著挑战,可能降低用户体验。本研究提出一种创新训练算法——对抗性直接偏好优化(ADPO),作为对直接偏好优化(DPO)的改进。ADPO算法通过训练模型为偏好响应分配更高概率分布、为不安全响应分配更低概率分布(后者通过毒性控制令牌自生成),增强模型对抗有害对话的鲁棒性,同时最小化性能退化。进一步地,我们证明相较于传统DPO,ADPO提供了更稳定的训练流程。据我们所知,这是首个将有害数据直接融入生成模型的DPO变体算法,从而减少人工创建安全对话数据的需求。