Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the training of Large Language Models (LLMs) using Actor-Critic methods (e.g., PPO), this baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself. However, as the policy continuously evolves, the value model requires expensive, synchronous incremental training to accurately track the shifting capabilities of the policy. To avoid this overhead, Group Relative Policy Optimization (GRPO) eliminates the coupled value model by using the average reward of a group of rollouts as the baseline; yet, this approach necessitates extensive sampling to maintain estimation stability. In this paper, we propose $V_0$, a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts without requiring parameter updates. We reframe value estimation by treating the policy's dynamic capability as an explicit context input; specifically, we leverage a history of instruction-performance pairs to dynamically profile the model, departing from the traditional paradigm that relies on parameter fitting to perceive capability shifts. Focusing on value estimation at State Zero (i.e., the initial prompt, hence $V_0$), our model serves as a critical resource scheduler. During GRPO training, $V_0$ predicts success rates prior to rollout, allowing for efficient sampling budget allocation; during deployment, it functions as a router, dispatching instructions to the most cost-effective and suitable model. Empirical results demonstrate that $V_0$ significantly outperforms heuristic budget allocation and achieves a Pareto-optimal trade-off between performance and cost in LLM routing tasks.
翻译:策略梯度方法依赖于一个基线来衡量动作的相对优势,确保模型能够强化那些超越其当前平均能力的行为。在使用行动者-评论者方法(例如PPO)训练大语言模型时,该基线通常由一个价值模型(评论者)估算,其规模通常与策略模型本身相当。然而,随着策略的持续演化,价值模型需要昂贵且同步的增量训练,以准确追踪策略不断变化的能力。为避免这一开销,组相对策略优化通过使用一组轨迹的平均奖励作为基线,消除了耦合的价值模型;然而,该方法需要大量采样以维持估计的稳定性。本文提出$V_0$,一种通用价值模型,能够在无需参数更新的情况下,估计任意模型在未见提示上的预期表现。我们通过将策略的动态能力视为显式的上下文输入来重构价值估计;具体而言,我们利用指令-性能对的历史记录来动态刻画模型特征,这有别于依赖参数拟合来感知能力变化的传统范式。聚焦于初始状态(即初始提示,因此命名为$V_0$)的价值估计,我们的模型充当关键的资源调度器。在GRPO训练期间,$V_0$在轨迹生成前预测成功率,从而实现高效的采样预算分配;在部署阶段,它作为路由器,将指令分派给最具成本效益且最合适的模型。实证结果表明,$V_0$显著优于启发式预算分配方法,并在大语言模型路由任务中实现了性能与成本的帕累托最优权衡。