Large Language Models (LLMs) have shown promising performance in code generation. However, how to reliably evaluate code generated by LLMs remains an unresolved problem. This paper presents CodeJudge, a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need for test cases. We investigate different ways to guide the LLM in performing "slow thinking" to arrive at an in-depth and reliable evaluation. We experimented with four LLMs as evaluators on four code generation datasets and five programming languages. The results show that CodeJudge significantly outperformed existing methods in most settings. Furthermore, compared with a SOTA GPT-3.5-based code evaluation method, CodeJudge achieved better results even when using a much smaller model, Llama-3-8B-Instruct. Our code and datasets are available on GitHub https://github.com/VichyTong/CodeJudge.
翻译:大语言模型(LLMs)在代码生成任务中展现出优异性能。然而,如何可靠地评估LLMs生成的代码仍是一个悬而未决的问题。本文提出CodeJudge——一种无需测试用例、利用LLMs评估生成代码语义正确性的代码评估框架。我们研究了引导LLMs进行“慢思考”以实现深入可靠评估的不同方法。我们在四个代码生成数据集和五种编程语言上,以四种LLMs作为评估器进行了实验。结果表明,在大多数设定下CodeJudge显著优于现有方法。此外,与基于GPT-3.5的最先进代码评估方法相比,即使使用参数量小得多的Llama-3-8B-Instruct模型,CodeJudge仍能取得更优结果。我们的代码与数据集已在GitHub开源:https://github.com/VichyTong/CodeJudge。