Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised. Our framework leverages concrete execution signals to ground the supervision of reasoning steps, while using tree-structured exploration to maintain multiple solution trajectories simultaneously. Experiments demonstrate that our approach enables even smaller models to achieve high success accuracy and performance metrics on competitive programming tasks, creates more reliable verification than traditional reward models without requiring training PRMs. Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency. The results suggest that providing structured reasoning space with concrete verification signals is crucial for solving complex programming tasks. We open-source all our code and data at: https://github.com/zhuohaoyu/ORPS
翻译:大型语言模型在代码生成方面展现出卓越能力,但在需要深度算法推理的复杂编程任务中仍常面临困难。尽管通过学习奖励模型进行过程监督在指导推理步骤方面展现出潜力,但该方法需要昂贵的训练数据且存在评估不可靠的问题。我们提出结果精化的过程监督,这是一种新颖的范式,将结果精化过程本身作为监督对象。我们的框架利用具体执行信号来锚定推理步骤的监督,同时采用树状结构探索以同时维持多个解轨迹。实验表明,我们的方法能使较小模型在竞技编程任务上获得较高的成功准确率和性能指标,相比传统奖励模型无需训练PRM即可实现更可靠的验证。该方法在5个模型和3个数据集上均取得显著提升:正确率平均提高26.9%,效率平均提升42.2%。结果表明,提供具有具体验证信号的结构化推理空间对解决复杂编程任务至关重要。我们在 https://github.com/zhuohaoyu/ORPS 开源了所有代码和数据。