CodeMind: Evaluating Large Language Models for Code Reasoning

Large Language Models (LLMs) have been widely used to automate programming tasks. Their capabilities have been evaluated by assessing the quality of generated code through tests or proofs. The extent to which they can reason about code is a critical question revealing important insights about their true capabilities. This paper introduces CodeMind, a framework designed to gauge the code reasoning abilities of LLMs through the following explicit and implicit code reasoning tasks: Independent Execution Reasoning (IER), Specification Reasoning (SR) and Dynamic Semantics Reasoning (DSR). The first evaluates the abilities of LLMs to simulate the execution of given inputs to a code and predict the output (IER). The second assesses the abilities of LLMs to incorporate the simulation of test data in the specification into code generation (SR). Finally, CodeMind evaluates LLMs' abilities to understand overall code semantics only given a specific input/output (DSR). Our extensive evaluation of ten LLMs across four widely used benchmarks using CodeMind shows that LLMs, depending on their size and training strategy, can reason about some dynamic aspects of code. However, their performance drops for code with higher complexity, non-trivial logical and arithmetic operators, non-primitive types, and API calls. We show that these reasoning tasks evaluate LLMs differently, and a comprehensive evaluation of code reasoning requires them all. Finally, we show that the performance of LLMs in bug repair is not correlated with any of the code reasoning tasks, and except for advanced frontier models, other LLMs do not incorporate code reasoning when performing bug repair.

翻译：大型语言模型（LLMs）已被广泛应用于编程任务的自动化。其能力通常通过测试或证明来评估生成代码的质量。而模型能在多大程度上对代码进行推理，是揭示其真实能力的关键问题。本文介绍了CodeMind，一个旨在通过以下显式和隐式代码推理任务来衡量LLMs代码推理能力的框架：独立执行推理（IER）、规格推理（SR）和动态语义推理（DSR）。第一项任务评估LLMs模拟给定输入执行代码并预测输出的能力（IER）；第二项评估LLMs将规格中的测试数据模拟整合到代码生成中的能力（SR）；最后，CodeMind评估LLMs在仅给定特定输入/输出的情况下理解整体代码语义的能力（DSR）。我们使用CodeMind对十个LLMs在四个广泛使用的基准上进行的广泛评估表明，LLMs（取决于其规模和训练策略）能够对代码的某些动态方面进行推理。然而，在处理复杂度更高、包含非平凡逻辑和算术运算符、非原始类型以及API调用的代码时，其性能会有所下降。我们还发现，这些推理任务对LLMs的评估方式各不相同，因此对代码推理的全面评估需要涵盖所有任务。最后，我们指出，LLMs在错误修复中的表现与任何代码推理任务均无相关性，且除先进的前沿模型外，其他LLMs在执行错误修复时并未融入代码推理能力。