Current reinforcement learning from human feedback (RLHF) methods primarily rely on scalar rewards from a trained reward model (RM). While effective, scalar rewards are often noisy and fail to capture fine-grained preference differences, whereas RM hidden states encode richer semantic and preference information. We introduce the representation-aware advantage estimation, which leverages RM hidden states and models them as auxiliary signals for better advantage estimation. Specifically, we propose the Graph-based Advantage Estimation (GraphAE), treat each sampled group as a graph, where nodes correspond to responses and edges capture their similarity in the RM hidden space. Then advantages are computed via graph propagation, enabling each sample to incorporate contextual information from its neighbors. GraphAE is lightweight and can be seamlessly integrated into existing group-based RL algorithms. We apply GraphAE to GRPO, GSPO and RLOO, and conduct extensive experiments on different models and benchmarks. Empirical results show consistent improvements across three benchmarks, with gains of up to + 6.3 on Arena-Hard-v0.1, + 8.27 on AlpacaEval 2.0, and + 0.22 on MT-Bench. These results demonstrate that leveraging RM representations leads to more sample efficient and robust RLHF.
翻译:当前基于人类反馈的强化学习(RLHF)方法主要依赖训练好的奖励模型(RM)产生的标量奖励。尽管有效,但标量奖励通常噪声较大,且难以捕捉细粒度的偏好差异,而RM的隐藏状态则编码了更丰富的语义与偏好信息。我们提出表示感知优势估计,该方法利用RM隐藏状态并将其建模为辅助信号以实现更优的优势估计。具体而言,我们提出基于图的优势估计(GraphAE),将每个采样组视为一个图,其中节点对应响应,边捕捉它们在RM隐藏空间中的相似性。通过图传播计算优势,使每个样本能够融入其邻居的上下文信息。GraphAE轻量级,可无缝集成到现有基于组的RL算法中。我们将GraphAE应用于GRPO、GSPO和RLOO,并在不同模型和基准上进行了广泛实验。实证结果表明,在三个基准上均有一致改进:Arena-Hard-v0.1提升高达+6.3,AlpacaEval 2.0提升+8.27,MT-Bench提升+0.22。这些结果证明,利用RM表征能带来更样本高效且稳健的RLHF。