Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.
翻译:语言模型如今必须在新环境中实现零样本泛化,并适配于推理缩放搜索流程(如AlphaEvolve),这类流程通过多种任务特定奖励函数筛选生成结果。然而,当前大语言模型(LLM)的后训练标准范式仅针对预设的标量奖励进行优化,导致模型倾向于生成低熵响应分布,难以展现推理时搜索所需的多样性。为此,我们提出向量策略优化(Vector Policy Optimization,VPO)算法,这是一种显式训练策略以应对多样化下游奖励函数并生成多样性解决方案的强化学习算法。VPO利用了实践中奖励常以向量形式呈现的特性(例如代码生成中逐测试用例的正确性、多种用户画像或奖励模型)。本质上,VPO可作为GRPO优势估计器的即插即用替代方案,其核心在于训练LLM输出一组解,使其中每个解专门适配向量奖励空间中的不同权衡。在四项任务中,VPO在测试时搜索指标(如pass@k和best@k)上均达到或超越最强标量强化学习基线,且随着搜索预算增加,性能差距持续扩大。在进化搜索场景中,VPO模型能破解GRPO模型完全无法解决的问题。随着测试时搜索日益标准化,面向多样性的优化或将成为默认的后训练目标。