Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods' effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.
翻译:强化学习从人类反馈(RLHF)对于将大型语言模型与人类偏好对齐至关重要。尽管近期研究集中于算法改进,但提示数据构建的重要性却被忽视。本文通过探索RLHF性能扩展中的数据驱动瓶颈(特别是奖励破解和响应多样性下降)来填补这一空白。我们引入了一种结合推理任务验证器(RTV)和生成式奖励模型(GenRM)的混合奖励系统以缓解奖励破解。我们还提出了一种新颖的提示选择方法Pre-PPO,以保持响应多样性并提升学习效果。此外,我们发现,在RLHF训练的早期阶段优先处理数学和编程任务能显著提升性能。在两种模型规模上的实验验证了我们方法的有效性和可扩展性。结果表明,RTV对奖励破解的抵抗性最强,其次是使用真实标签的GenRM,再次是使用SFT Best-of-N响应的GenRM。我们的策略能够快速捕捉细微的任务特定差异,从而显著提升整体RLHF性能。这项工作强调了精心构建数据的重要性,并为克服RLHF中的性能障碍提供了实用方法。