Aligning Large Language Model (LLM) responses with human preferences is vital for building safe and controllable AI systems. While preference optimization methods based on Plackett-Luce (PL) and Bradley-Terry (BT) models have shown promise, they face challenges such as poor handling of harmful content, inefficient use of dispreferred responses, and, specifically for PL, high computational costs. To address these issues, we propose Hard Preference Sampling (HPS), a novel framework for robust and efficient human preference alignment. HPS introduces a training loss that prioritizes the most preferred response while rejecting all dispreferred and harmful ones. It emphasizes "hard" dispreferred responses -- those closely resembling preferred ones -- to enhance the model's rejection capabilities. By leveraging a single-sample Monte Carlo sampling strategy, HPS reduces computational overhead while maintaining alignment quality. Theoretically, HPS improves sample efficiency over existing PL methods and maximizes the reward margin between preferred and dispreferred responses, ensuring clearer distinctions. Experiments on HH-RLHF and PKU-Safety datasets validate HPS's effectiveness, achieving comparable BLEU and reward scores while greatly improving reward margins and thus reducing harmful content generation.
翻译:摘要:将大语言模型(LLM)的响应与人类偏好对齐,对于构建安全且可控的人工智能系统至关重要。尽管基于Plackett-Luce(PL)和Bradley-Terry(BT)模型的偏好优化方法已展现出潜力,但仍面临诸多挑战:对有害内容的处理能力不足、非偏好响应的利用效率低下,以及PL方法特有的高计算成本。针对这些问题,我们提出硬偏好采样(HPS),一种用于稳健高效人类偏好对齐的新型框架。HPS引入了一种训练损失,该损失优先考虑最符合偏好的响应,同时拒绝所有非偏好及有害响应。该方法强调对"硬"非偏好响应(即与偏好响应高度相似的样本)的学习,从而增强模型的拒绝能力。通过采用单样本蒙特卡洛采样策略,HPS在保持对齐质量的同时降低了计算开销。理论上,HPS相比现有PL方法提升了样本效率,并最大化偏好响应与非偏好响应之间的奖励边际,确保更清晰的区分能力。在HH-RLHF与PKU-Safety数据集上的实验验证了HPS的有效性:在保持可比较的BLEU分数和奖励分数的同时,大幅提升了奖励边际,从而显著减少了有害内容的生成。