Generating high-quality code remains a challenge for Large Language Models (LLMs). For the evolution of reasoning models on this task, reward models are a necessary intermediate step. These models judge outcomes or intermediate steps. Decoder-only transformer models can be turned into reward models by introducing a regression layer and supervised fine-tuning. While it is known that reflection capabilities generally increase with the size of a model, we want to investigate whether state-of-the-art small language models like the Phi-4 family can be turned into usable reward models blending the consideration of process rewards and outcome rewards. Targeting this goal, we construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark. We then train a value-head model to estimate the success probability of intermediate outputs. Our evaluation shows that small LLMs are capable of serving as effective reward models or code evaluation critics, successfully identifying correct solutions among multiple candidates. Using this critic, we achieve over a 20% improvement in the search capability of the most accurate code out of multiple generations.
翻译:生成高质量代码对于大型语言模型(LLMs)而言仍是一项挑战。在该任务中推理模型的演进过程中,奖励模型是必要的中间步骤。这些模型可对最终结果或中间步骤进行评判。通过引入回归层并进行监督微调,仅解码器Transformer模型可转化为奖励模型。尽管已知模型的反思能力通常随模型规模增大而提升,但我们希望探究如Phi-4系列等先进小型语言模型能否转化为可用的奖励模型,并融合过程奖励与结果奖励的考量。为此,我们基于APPS编程挑战基准构建了带有正确性标签的代码样本数据集,随后训练价值头模型以估计中间输出的成功概率。评估结果表明,小型LLMs能够作为有效的奖励模型或代码评估评判器,成功从多个候选方案中识别正确解法。使用该评判器,我们在多轮生成中搜索最准确代码的能力实现了超过20%的提升。