Pass@K策略优化：解决更难的强化学习问题 (Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems)

Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

翻译：强化学习（RL）算法通常对每个问题采样多个n>1的解决方案尝试，并独立地对它们进行奖励。这种方法优化的是pass@1性能，优先考虑孤立样本的强度，却以牺牲样本集的多样性和集体效用为代价。这未能充分利用采样能力，限制了探索性，并最终阻碍了在更难示例上的改进。作为改进方案，我们提出了Pass-at-k策略优化（PKPO），这是一种对最终奖励的变换，可直接优化pass@k性能，从而优化那些在联合考虑时能最大化奖励的样本集。我们的贡献在于，针对二元和连续奖励设置，推导出了新颖的低方差、无偏的pass@k及其梯度估计器。我们证明，使用我们的估计器进行优化，可简化为标准强化学习，其奖励经过一个稳定高效的变换函数进行联合变换。先前的工作仅限于k=n的情况，而我们的方法是首个能够对任意k <= n实现稳健的pass@k优化的方法。此外，我们的方法无需以牺牲pass@1性能来换取pass@k增益，而是允许在训练过程中对k进行退火处理，从而同时优化两个指标，并常常在获得显著pass@k增益的同时，也取得强劲的pass@1数值。我们在玩具实验上验证了我们的奖励变换，这些实验揭示了我们公式的方差降低特性。我们还使用了开源大语言模型GEMMA-2进行了真实世界示例验证。我们发现，我们的变换能有效针对目标k进行优化。此外，更高的k值能够解决更多、更难的问题，而对k进行退火则能同时提升pass@1和pass@k。至关重要的是，对于传统pass@1优化停滞不前的具有挑战性的任务集，我们的pass@k方法能解锁学习过程，这很可能是因为通过优先考虑联合效用而非单个样本的效用，实现了更好的探索。