IRPO：通过强化学习扩展Bradley-Terry模型 (IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning)

Generative Reward Models (GRMs) have attracted considerable research interest in reward modeling due to their interpretability, inference-time scalability, and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck when integrated with RL algorithms such as Group Relative Policy Optimization (GRPO). This bottleneck arises from two factors: (i) the O(n^2) time complexity of pairwise comparisons required to obtain relative scores, and (ii) the computational overhead of repeated sampling or additional chain-of-thought (CoT) reasoning to improve performance. To address the first factor, we propose Intergroup Relative Preference Optimization (IRPO), a novel RL framework that incorporates the well-established Bradley-Terry model into GRPO. By generating a pointwise score for each response, IRPO enables efficient evaluation of arbitrarily many candidates during RL training while preserving interpretability and fine-grained reward signals. Experimental results demonstrate that IRPO achieves state-of-the-art (SOTA) performance among pointwise GRMs across multiple benchmarks, with performance comparable to that of current leading pairwise GRMs. Furthermore, we show that IRPO significantly outperforms pairwise GRMs in post-training evaluations.

翻译：生成式奖励模型（GRMs）因其可解释性、推理时的可扩展性以及通过强化学习（RL）进行优化的潜力，在奖励建模领域引起了广泛的研究兴趣。然而，广泛使用的成对GRMs在与诸如组相对策略优化（GRPO）等RL算法结合时，会产生计算瓶颈。这一瓶颈源于两个因素：（i）为获取相对分数所需的成对比较具有O(n^2)时间复杂度；（ii）为提升性能而进行的重复采样或额外思维链（CoT）推理带来的计算开销。针对第一个因素，我们提出了组间相对偏好优化（IRPO），这是一个新颖的RL框架，它将成熟的Bradley-Terry模型融入GRPO。通过为每个响应生成一个点式分数，IRPO能够在RL训练期间高效评估任意数量的候选方案，同时保持可解释性和细粒度的奖励信号。实验结果表明，IRPO在多个基准测试中实现了点式GRMs中的最先进（SOTA）性能，其表现与当前领先的成对GRMs相当。此外，我们证明IRPO在后训练评估中显著优于成对GRMs。