Code generation is a core application of large language models (LLMs), yet LLMs still frequently fail on complex programming tasks. Given its success in mathematical reasoning, test-time scaling approaches such as Process Reward Model (PRM)-based Best-of-N selection offer a promising way to improve performance. However, existing PRMs remain ineffective for code generation due to the lack of meaningful step decomposition in code and the noise of Monte Carlo-estimated partial-solution correctness scores (rewards). To address these challenges, we propose FunPRM. FunPRM prompts LLMs to encourage modular code generation organized into functions, with functions treated as PRM reasoning steps. Furthermore, FunPRM introduces a novel meta-learning-based reward correction mechanism that leverages clean final-solution rewards obtained via a unit-test-based evaluation system to purify noisy partial-solution rewards. Experiments on LiveCodeBench and BigCodeBench demonstrate that FunPRM consistently outperforms existing test-time scaling methods across five base LLMs, notably achieving state-of-the-art performance on LiveCodeBench when combined with O4-mini. Furthermore, FunPRM produces code that is more readable and reusable for developers.
翻译:代码生成是大语言模型(LLM)的核心应用之一,然而LLM在处理复杂编程任务时仍经常失败。鉴于其在数学推理中的成功,基于过程奖励模型(PRM)的N选优等测试时扩展方法为提高性能提供了有前景的途径。然而,现有PRM在代码生成中仍然效果有限,这源于代码中缺乏有意义的步骤分解以及蒙特卡洛估计的部分解正确性分数(奖励)存在噪声。为应对这些挑战,我们提出了FunPRM。FunPRM通过提示LLM鼓励生成以函数组织的模块化代码,并将函数视为PRM的推理步骤。此外,FunPRM引入了一种新颖的基于元学习的奖励修正机制,该机制利用通过基于单元测试的评估系统获得的干净最终解奖励来净化噪声部分解奖励。在LiveCodeBench和BigCodeBench上的实验表明,FunPRM在五种基础LLM上均持续优于现有的测试时扩展方法,特别是当与O4-mini结合时,在LiveCodeBench上达到了最先进的性能。此外,FunPRM生成的代码对开发者而言更具可读性和可复用性。