Preference datasets are essential for incorporating human preferences into pre-trained language models, playing a key role in the success of Reinforcement Learning from Human Feedback. However, these datasets often demonstrate conflicting alignment objectives, leading to increased vulnerability to jailbreak attacks and challenges in adapting downstream tasks to prioritize specific alignment objectives without negatively impacting others. In this work, we introduce a novel statistical metric, Alignment Dimension Conflict, to quantify the degree of conflict within preference datasets. We then present \texttt{Hummer} and its fine-grained variant, \texttt{Hummer-F}, as innovative pairwise preference datasets with reduced-conflict alignment objectives. \texttt{Hummer} is built based on UltraFeedback and is enhanced by AI feedback from GPT-4, marking as the first preference dataset aimed at reducing the competition between alignment objectives. Furthermore, we develop reward models, HummerRM and HummerRM-F, which employ a hybrid sampling approach to balance diverse alignment objectives effectively. This sampling method positions HummerRM as an ideal model for domain-specific further fine-tuning and reducing vulnerabilities to attacks.
翻译:偏好数据集对于将人类偏好融入预训练语言模型至关重要,在基于人类反馈的强化学习的成功中发挥着关键作用。然而,这些数据集常常表现出相互冲突的对齐目标,导致模型对越狱攻击的脆弱性增加,并使下游任务难以在不负面影响其他目标的情况下优先考虑特定的对齐目标。在本研究中,我们引入了一种新颖的统计度量——对齐维度冲突,以量化偏好数据集内部的冲突程度。随后,我们提出了 \texttt{Hummer} 及其细粒度变体 \texttt{Hummer-F},作为具有低冲突对齐目标的创新型成对偏好数据集。\texttt{Hummer} 基于 UltraFeedback 构建,并通过 GPT-4 的 AI 反馈进行增强,是首个旨在减少对齐目标间竞争的偏好数据集。此外,我们开发了奖励模型 HummerRM 和 HummerRM-F,它们采用混合采样方法以有效平衡多样化的对齐目标。这种采样方法使 HummerRM 成为领域特定进一步微调和降低攻击脆弱性的理想模型。