Improving Code Generation via Small Language Model-as-a-judge

Large language models (LLMs) have shown remarkable capabilities in automated code generation. While effective for mainstream languages, they may underperform on less common or domain-specific languages, prompting companies to develop in-house code generators. While open-source models can be trained for this, only LLMs with tens of billions of parameters match the performance of commercial tools, demanding costly training and deployment. Recent work proposed supporting code generation with smaller models (SLMs) by generating multiple candidate solutions and using another SLM to select the most likely correct one. The most recent work in this area is the one by Sun et al. [29] presenting RankEF, a T5 model trained to rank code solutions using both execution-based and non-execution-based information. However, Sun et al. do not assess the T5 ranker's classification accuracy, that is, how often it misjudges correct implementations as incorrect or vice versa, leaving open questions about the reliability of LMs as code correctness judges for other tasks (e.g., automated code review). Moreover, their experiments involve relatively old models, making it unclear the extent to which such a methodology would still help companies in cheaply training their own code generators with performance comparable to those of massive LLMs. We present a study addressing these limitations. We train several state-of-the-art SLMs as code correctness judges and assess their ability to discriminate between correct and wrong implementations. We show that modern SLMs outperform RankEF, even without exploiting execution-based information. When used as code rankers, they achieve higher performance gains than RankEF and perform competitively with LLMs 5-25x larger, at a fraction of the cost.

翻译：大型语言模型（LLMs）在自动化代码生成方面展现出卓越能力。尽管对主流语言效果显著，但在处理较冷门或领域特定语言时可能表现不佳，这促使企业开发内部代码生成器。虽然开源模型可针对此类任务进行训练，但仅有参数规模达数百亿的LLMs才能匹配商业工具的性能，其训练与部署成本高昂。近期研究提出通过小型模型（SLMs）支持代码生成：首先生成多个候选解决方案，再使用另一个SLM选择最可能正确的方案。该领域最新成果是Sun等人[29]提出的RankEF——一个利用基于执行和非执行信息对代码解决方案进行排序的T5模型。然而，Sun等人未评估T5排序器的分类准确性（即其将正确实现误判为错误或反之的频率），这使语言模型作为代码正确性评判器在其他任务（如自动化代码审查）中的可靠性存疑。此外，其实验涉及相对陈旧的模型，导致此类方法能否继续帮助企业以低成本训练出性能可比拟大型LLMs的代码生成器尚不明确。本研究针对这些局限性展开探索。我们训练了若干先进SLMs作为代码正确性评判器，并评估其区分正确与错误实现的能力。实验表明，即使不利用基于执行的信息，现代SLMs仍能超越RankEF。当作为代码排序器使用时，这些模型比RankEF获得更高的性能提升，且以极低成本实现了与规模大5-25倍的LLMs相竞争的表现。