Large language models (LLMs) have revolutionized the role of AI, yet also pose potential risks of propagating unethical content. Alignment technologies have been introduced to steer LLMs towards human preference, gaining increasing attention. Despite notable breakthroughs in this direction, existing methods heavily rely on high-quality positive-negative training pairs, suffering from noisy labels and the marginal distinction between preferred and dispreferred response data. Given recent LLMs' proficiency in generating helpful responses, this work pivots towards a new research focus: achieving alignment using solely human-annotated negative samples, preserving helpfulness while reducing harmfulness. For this purpose, we propose Distributional Dispreference Optimization (D$^2$O), which maximizes the discrepancy between the generated responses and the dispreferred ones to effectively eschew harmful information. We theoretically demonstrate that D$^2$O is equivalent to learning a distributional instead of instance-level preference model reflecting human dispreference against the distribution of negative responses. Besides, D$^2$O integrates an implicit Jeffrey Divergence regularization to balance the exploitation and exploration of reference policies and converges to a non-negative one during training. Extensive experiments demonstrate that our method achieves comparable generation quality and surpasses the latest baselines in producing less harmful and more informative responses with better training stability and faster convergence.
翻译:大型语言模型(LLMs)彻底改变了人工智能的角色,但也带来了传播不道德内容的潜在风险。为使LLMs符合人类偏好,对齐技术应运而生并日益受到关注。尽管该方向已取得显著突破,现有方法仍高度依赖高质量的正负训练样本对,面临标签噪声以及偏好与非偏好响应数据之间边际区分度不足的问题。鉴于当前LLMs已具备生成有益响应的能力,本研究转向新焦点:仅使用人工标注的负样本实现对齐,在保持有益性的同时降低危害性。为此,我们提出分布性厌恶偏好优化(D²O),通过最大化生成响应与非偏好响应之间的差异来有效规避有害信息。我们从理论上证明,D²O等价于学习一个分布性(而非实例级)偏好模型,该模型反映人类对负响应分布的厌恶。此外,D²O整合了隐式杰弗里散度正则化项,以平衡参考策略的利用与探索,并在训练过程中收敛至非负值。大量实验表明,我们的方法在保持可比生成质量的同时,在产生更少危害、包含更多信息的响应方面超越最新基线,并展现出更优的训练稳定性与更快的收敛速度。