Large Language Models (LLMs) generating unsafe responses to toxic prompts is a significant issue in their applications. While various efforts aim to address this safety concern, previous approaches often demand substantial human data collection or rely on the less dependable option of using another LLM to generate corrective data. In this paper, we aim to take this problem and overcome limitations of requiring significant high-quality human data. Our method requires only a small set of unsafe responses to toxic prompts, easily obtained from the unsafe LLM itself. By employing a semantic cost combined with a negative Earth Mover Distance (EMD) loss, we guide the LLM away from generating unsafe responses. Additionally, we propose a novel lower bound for EMD loss, enabling more efficient optimization. Our results demonstrate superior performance and data efficiency compared to baselines, and we further examine the nuanced effects of over-alignment and potential degradation of language capabilities when using contrastive data.
翻译:大语言模型(LLMs)对恶意提示生成不安全响应是其实际应用中的一个重要问题。尽管已有多种研究致力于解决这一安全隐患,但现有方法通常需要大量人工数据收集,或依赖于使用另一个LLM生成校正数据这一可靠性较低的方案。本文旨在应对该问题,并克服对大量高质量人工数据的依赖。我们的方法仅需少量针对恶意提示的不安全响应样本,这些样本可轻易从待修正的不安全LLM自身获取。通过采用结合负向地球移动距离(EMD)损失的语义代价函数,我们引导LLM避免生成不安全响应。此外,我们提出了EMD损失的新下界,实现了更高效的优化。实验结果表明,相较于基线方法,我们的方法在性能和数据效率方面均表现优异;同时,我们还深入分析了使用对比数据时可能出现的过度对齐现象及语言能力潜在退化等细微影响。