The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error's position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifiers exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.
翻译:基于强化学习的对齐方法被广泛采用,突显了奖励模型日益增长的重要性。目前已有多种基准被构建用于评估不同领域和场景下的奖励模型。然而,尽管长文本生成在实际应用中至关重要,但在评估长文本生成的奖励模型方面仍存在显著空白。为填补这一空白,我们引入了长文本奖励基准,这是首个专门为长文本生成设计的奖励建模测试平台。我们的基准涵盖五个关键子任务:问答、检索增强生成、对话、写作和推理。我们通过精心设计的多阶段数据收集流程收集了指令和偏好数据,并对20多个主流奖励模型(包括分类器和生成模型)进行了广泛的实验。我们的研究结果表明,当前模型仍缺乏长文本奖励建模能力。此外,我们设计了一种新颖的长文本“大海捞针”测试,揭示了奖励模型性能与错误在响应中的位置以及整体响应长度之间的相关性,并且在分类模型和生成模型之间观察到了不同的特性。最后,我们证明了在相同数据上训练的分类器比生成模型表现出更好的泛化能力。作为首个长文本奖励建模基准,这项工作旨在为可视化这一关键领域的进展提供一个稳健的平台。