In this paper, we explore how directly pretraining a value model simplifies and stabilizes reinforcement learning from human feedback (RLHF). In reinforcement learning, value estimation is the key to policy optimization, distinct from reward supervision. The value function predicts the \emph{return-to-go} of a partial answer, that is, how promising the partial answer is if it were continued to completion. In RLHF, however, the standard pipeline first pretrains a reward model and then learns a value function online, even though no new reward signals are available once preference data is collected. This makes critic learning redundant, as the process of training a reward model and then deriving a value model is informationally equivalent to directly pretraining a value model. Importantly, this requires no additional supervision, and our value model is trained on exactly the same data used for reward modeling. Building on this insight, we introduce \emph{Decoupled Value Policy Optimization} (DVPO), a framework that pretrains a \emph{Global Value Model} (GVM) offline and freezes it as a universal critic for policy learning. The GVM provides stable, fine-grained credit assignment without critic drift or trajectory sampling. Experiments across MT-Bench, Alpaca-Eval, and Arena-Hard demonstrate that DVPO matches or surpasses state-of-the-art RLHF methods. These results highlight RLHF can be reframed as policy-only optimization guided by a single pretrained value model.
翻译:本文探讨了直接预训练价值模型如何简化和稳定基于人类反馈的强化学习(RLHF)。在强化学习中,价值估计是策略优化的关键,这与奖励监督有本质区别。价值函数预测部分答案的\emph{未来累积回报},即若将当前部分答案继续完善至完成,其前景如何。然而在RLHF的标准流程中,通常先预训练奖励模型,再在线学习价值函数,尽管在收集偏好数据后并无新的奖励信号可用。这使得评论家学习变得冗余,因为训练奖励模型再推导价值模型的过程,在信息层面等同于直接预训练价值模型。重要的是,该方法无需额外监督,我们的价值模型训练数据与奖励建模完全一致。基于这一洞见,我们提出\emph{解耦价值策略优化}(DVPO)框架,该框架离线预训练\emph{全局价值模型}(GVM)并将其冻结为策略学习的通用评论家。GVM提供稳定、细粒度的信用分配,避免了评论家漂移或轨迹采样问题。在MT-Bench、Alpaca-Eval和Arena-Hard上的实验表明,DVPO达到或超越了当前最先进的RLHF方法。这些结果证明,RLHF可被重新定义为由单一预训练价值模型指导的纯策略优化过程。