Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.
翻译:强化学习算法对每个问题采样多个(n>1)解决方案并独立给予奖励。这优化了pass@1性能,并优先考虑孤立样本的强度,牺牲了样本集的多样性和集体效用。这导致采样能力未被充分利用,限制了探索及最终在更困难示例上的改进。作为解决方案,我们提出Pass-at-k策略优化(PKPO),一种对最终奖励的变换,直接优化pass@k性能,从而优化那些在联合考虑时能最大化奖励的样本集。我们的贡献在于推导了在二元和连续奖励设置下,用于pass@k及其梯度的新型低方差无偏估计量。我们证明,使用我们的估计量进行优化会简化为标准强化学习,其中奖励由稳定且高效的变换函数联合变换。以往工作局限于k=n,而我们是首次能够针对任意k≤n实现pass@k的鲁棒优化。此外,我们的方法并非以牺牲pass@1性能换取pass@k增益,而是在训练过程中对k进行退火处理,从而同时优化这两个指标,通常在实现显著pass@k增益的同时获得强劲的pass@1数值。我们通过玩具实验验证了奖励变换,揭示了公式的方差缩减特性。我们还使用开源大型语言模型GEMMA-2展示了真实世界示例。结果发现,我们的变换有效优化了目标k值。此外,更高的k值有助于解决更多且更困难的问题,而退火k则同时提升了pass@1和pass@k性能。关键在于,在传统pass@1优化停滞的具有挑战性的任务集上,我们的pass@k方法突破了学习瓶颈,这很可能归因于其通过优先考虑联合效用而非个体样本效用实现了更好的探索。