Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. They are trained using preference datasets where each example consists of one input prompt, two responses, and a preference label. As curating a high-quality human labeled preference dataset is both time-consuming and expensive, people often rely on existing powerful LLMs for preference label generation. This can potentially introduce noise and impede RM training. In this work, we present RMBoost, a novel synthetic preference data generation paradigm to boost reward model quality. Unlike traditional methods, which generate two responses before obtaining the preference label, RMBoost first generates one response and selects a preference label, followed by generating the second more (or less) preferred response conditioned on the pre-selected preference label and the first response. This approach offers two main advantages. First, RMBoost reduces labeling noise since preference pairs are constructed intentionally. Second, RMBoost facilitates the creation of more diverse responses by incorporating various quality aspects (e.g., helpfulness, relevance, completeness) into the prompts. We conduct extensive experiments across three diverse datasets and demonstrate that RMBoost outperforms other synthetic preference data generation techniques and significantly boosts the performance of four distinct reward models.
翻译:奖励模型(RMs)对于将大语言模型(LLMs)与人类偏好对齐至关重要。它们通常使用偏好数据集进行训练,其中每个样本包含一个输入提示、两个响应和一个偏好标签。由于构建高质量的人工标注偏好数据集既耗时又昂贵,研究者常依赖现有的大语言模型生成偏好标签,但这可能引入噪声并阻碍奖励模型的训练。本文提出RMBoost——一种新颖的合成偏好数据生成范式,旨在提升奖励模型的质量。与传统方法(先生成两个响应再获取偏好标签)不同,RMBoost首先生成一个响应并选定偏好标签,随后基于预设的偏好标签和首个响应,生成第二个更受(或更不受)偏好的响应。该方法具有两大优势:首先,通过有意识地构建偏好对,RMBoost有效降低了标注噪声;其次,通过在提示中融入多维质量属性(如帮助性、相关性、完整性),该方法能促进生成更具多样性的响应。我们在三个异构数据集上进行了广泛实验,结果表明RMBoost优于其他合成偏好数据生成技术,并能显著提升四种不同奖励模型的性能。