Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.
翻译:强化学习(RL)算法通常对每个问题采样多个n>1的解决方案尝试,并独立地对它们进行奖励。这种方法优化的是pass@1性能,优先考虑孤立样本的强度,却以牺牲样本集的多样性和集体效用为代价。这未能充分利用采样能力,限制了探索性,并最终阻碍了在更难示例上的改进。作为改进方案,我们提出了Pass-at-k策略优化(PKPO),这是一种对最终奖励的变换,可直接优化pass@k性能,从而优化那些在联合考虑时能最大化奖励的样本集。我们的贡献在于,针对二元和连续奖励设置,推导出了新颖的低方差、无偏的pass@k及其梯度估计器。我们证明,使用我们的估计器进行优化,可简化为标准强化学习,其奖励经过一个稳定高效的变换函数进行联合变换。先前的工作仅限于k=n的情况,而我们的方法是首个能够对任意k <= n实现稳健的pass@k优化的方法。此外,我们的方法无需以牺牲pass@1性能来换取pass@k增益,而是允许在训练过程中对k进行退火处理,从而同时优化两个指标,并常常在获得显著pass@k增益的同时,也取得强劲的pass@1数值。我们在玩具实验上验证了我们的奖励变换,这些实验揭示了我们公式的方差降低特性。我们还使用了开源大语言模型GEMMA-2进行了真实世界示例验证。我们发现,我们的变换能有效针对目标k进行优化。此外,更高的k值能够解决更多、更难的问题,而对k进行退火则能同时提升pass@1和pass@k。至关重要的是,对于传统pass@1优化停滞不前的具有挑战性的任务集,我们的pass@k方法能解锁学习过程,这很可能是因为通过优先考虑联合效用而非单个样本的效用,实现了更好的探索。