Hummer: Towards Limited Competitive Preference Dataset

Preference datasets are essential for incorporating human preferences into pre-trained language models, playing a key role in the success of Reinforcement Learning from Human Feedback. However, these datasets often demonstrate conflicting alignment objectives, leading to increased vulnerability to jailbreak attacks and challenges in adapting downstream tasks to prioritize specific alignment objectives without negatively impacting others. In this work, we introduce a novel statistical metric, Alignment Dimension Conflict, to quantify the degree of conflict within preference datasets. We then present \texttt{Hummer} and its fine-grained variant, \texttt{Hummer-F}, as innovative pairwise preference datasets with reduced-conflict alignment objectives. \texttt{Hummer} is built based on UltraFeedback and is enhanced by AI feedback from GPT-4, marking as the first preference dataset aimed at reducing the competition between alignment objectives. Furthermore, we develop reward models, HummerRM and HummerRM-F, which employ a hybrid sampling approach to balance diverse alignment objectives effectively. This sampling method positions HummerRM as an ideal model for domain-specific further fine-tuning and reducing vulnerabilities to attacks.

翻译：偏好数据集对于将人类偏好融入预训练语言模型至关重要，在基于人类反馈的强化学习的成功中发挥着关键作用。然而，这些数据集常常表现出相互冲突的对齐目标，导致模型对越狱攻击的脆弱性增加，并使下游任务难以在不负面影响其他目标的情况下优先考虑特定的对齐目标。在本研究中，我们引入了一种新颖的统计度量——对齐维度冲突，以量化偏好数据集内部的冲突程度。随后，我们提出了 \texttt{Hummer} 及其细粒度变体 \texttt{Hummer-F}，作为具有低冲突对齐目标的创新型成对偏好数据集。\texttt{Hummer} 基于 UltraFeedback 构建，并通过 GPT-4 的 AI 反馈进行增强，是首个旨在减少对齐目标间竞争的偏好数据集。此外，我们开发了奖励模型 HummerRM 和 HummerRM-F，它们采用混合采样方法以有效平衡多样化的对齐目标。这种采样方法使 HummerRM 成为领域特定进一步微调和降低攻击脆弱性的理想模型。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日