Inference-time computation provides an important axis for scaling language model performance, but naively scaling compute through techniques like Best-of-$N$ sampling can cause performance to degrade due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we focus on inference-time alignment which we formalize as the problem of improving a pre-trained policy's responses for a prompt of interest, given access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policy's coverage over high-quality responses for performance and compute scaling: 1. We show that Best-of-$N$ alignment with an ideal choice for $N$ can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when $N$ is large, and fails to achieve tight guarantees under more realistic coverage conditions. 2. We introduce $\texttt{InferenceTimePessimism}$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing the principle of pessimism in the face of uncertainty via rejection sampling; we prove that its performance is optimal and does not degrade with $N$, meaning it is scaling-monotonic. We complement our theoretical results with an experimental evaluation that demonstrate the benefits of $\texttt{InferenceTimePessimism}$ across a variety of tasks and models.
翻译:推理时计算为扩展语言模型性能提供了重要途径,但通过诸如最佳-$N$采样等技术简单扩展计算可能因奖励黑客问题导致性能下降。为从理论上理解如何最佳利用额外计算,我们聚焦于推理时对齐问题,将其形式化为:在给定不完美奖励模型的条件下,针对特定提示改进预训练策略的响应。我们从(i)响应质量与(ii)计算开销两个维度分析推理时对齐算法的性能,并提出新结果以揭示预训练策略对高质量响应的覆盖度在性能与计算扩展中的关键作用:1. 我们证明在严格覆盖条件下,具有理想$N$值的最佳-$N$对齐可实现最优性能,但理论上当$N$较大时会遭受奖励黑客攻击,且在更现实的覆盖条件下无法达到严格性能保证。2. 我们提出新算法$\texttt{InferenceTimePessimism}$,该算法通过审慎利用推理时计算缓解奖励黑客问题,基于拒绝采样实现在不确定性面前的悲观原则;我们证明其性能具有最优性且不随$N$值增加而退化,即具有扩展单调性。我们通过跨任务与模型的实验评估验证了$\texttt{InferenceTimePessimism}$的优势,补充了理论结果。