Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and ClassEval). Code reasoning is one of the most essential abilities of code LLMs, but existing benchmarks for code reasoning are not sufficient. Typically, they focus on predicting the input and output of a program, ignoring the evaluation of the intermediate behavior during program execution, as well as the logical consistency (e.g., the model should not give the correct output if the prediction of execution path is wrong) when performing the reasoning. To address these problems, in this paper, we propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution. We utilize existing code benchmarks and adapt them to new benchmarks within our framework. A large-scale empirical study is conducted and most LLMs show unsatisfactory performance on both Runtime Behavior Reasoning (i.e., an average accuracy of 44.4%) and Incremental Consistency Evaluation (i.e., an average IC score of 10.3). Evaluation results of current code LLMs reflect the urgent need for the community to strengthen the code reasoning capability of code LLMs.
翻译:针对代码的大语言模型(即代码LLMs)已展现出强大的代码理解与生成能力。为多维度评估代码LLMs能力,研究者提出了众多基准测试(例如HumanEval和ClassEval)。代码推理是代码LLMs最核心的能力之一,但现有代码推理基准尚不充分。现有基准主要聚焦于预测程序的输入与输出,忽视了对程序执行过程中间行为的评估,也未考虑执行推理时的逻辑一致性(例如模型在预测执行路径出错时不应给出正确输出)。为解决上述问题,本文提出一个名为REval的框架,用于通过程序执行评估代码LLMs的代码推理能力与一致性。我们利用现有代码基准,并将其适配至本框架下的新型基准。通过大规模实证研究发现,大多数LLMs在运行时行为推理(平均准确率44.4%)与增量一致性评估(平均IC得分10.3)上表现不佳。当前代码LLMs的评估结果反映出社区亟需加强代码LLMs的代码推理能力。