Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and ClassEval). Code reasoning is one of the most essential abilities of code LLMs, but existing benchmarks for code reasoning are not sufficient. Typically, they focus on predicting the input and output of a program, ignoring the evaluation of the intermediate behavior during program execution, as well as the logical consistency (e.g., the model should not give the correct output if the prediction of execution path is wrong) when performing the reasoning. To address these problems, in this paper, we propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution. We utilize existing code benchmarks and adapt them to new benchmarks within our framework. A large-scale empirical study is conducted and most LLMs show unsatisfactory performance on both Runtime Behavior Reasoning (i.e., an average accuracy of 44.4%) and Incremental Consistency Evaluation (i.e., an average IC score of 10.3). Evaluation results of current code LLMs reflect the urgent need for the community to strengthen the code reasoning capability of code LLMs. Our code, data, and \newname leaderboard are available at https://r-eval.github.io.
翻译:面向代码的大语言模型(即代码大语言模型)已展现出强大的代码理解与生成能力。为评估代码大语言模型在多个维度的能力,研究者已提出多种基准测试(如HumanEval与ClassEval)。代码推理是代码大语言模型最核心的能力之一,但现有代码推理基准测试尚不充分。典型地,这些基准主要关注预测程序的输入与输出,忽视了程序执行过程中间行为的评估,以及在执行推理时逻辑一致性(例如,若执行路径预测错误,模型不应给出正确输出)的考量。为解决这些问题,本文提出一个名为REval的框架,用于评估代码大语言模型在程序执行过程中的代码推理能力与一致性。我们利用现有代码基准测试,并将其适配至本框架下的新基准。通过开展大规模实证研究,我们发现多数大语言模型在运行时行为推理(平均准确率44.4%)与增量一致性评估(平均IC得分10.3%)上均表现欠佳。当前代码大语言模型的评估结果反映出学术界亟需加强模型的代码推理能力。我们的代码、数据及\newname排行榜已发布于https://r-eval.github.io。