评估大型语言模型代码执行推理的连贯性与一致性 (Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models)

This paper proposes CES, a task to evaluate the abilities of LLMs in simulating program execution and using that reasoning in programming tasks. Besides measuring the correctness of variable predictions during execution simulation, CES introduces the notion of coherence to determine whether the simulation complies with commonsense execution logic, even if the predicted values along the simulations are incorrect. This enables CES to rule out suspiciously correct output predictions due to reasoning shortcuts, hallucinations, or potential data leakage. CES also introduces a novel metric to measure reasoning consistency across tests with the same or different prime path coverage in a spectrum: strong, weak, and random. Evaluating 16 LLMs (including three reasoning LLMs) using CES indicates 81.42% coherent execution simulation on HumanEval, 46.92% and 53.08% of which result in correct and incorrect output predictions. Frontier LLMs such as GPT-4 and DeepSeek-R1 have the most incoherent execution reasoning, mostly due to natural language shortcuts. Despite relatively coherent execution simulation, LLMs' reasoning performance across different tests is inconsistent, mostly random (48.87%) or weak (45.37%), potentially explaining their weakness in programming tasks that require path-sensitive program analysis to succeed. We also compare CES with bug prediction/localization/repair, which intuitively requires control- and data-flow awareness. We observe that LLMs barely incorporate execution reasoning into their analysis for bug-related tasks, and their success is primarily due to inherent abilities in pattern matching or natural language shortcuts, if not data leakage. Without reasoning, there is a threat to the generalizability of LLMs in dealing with unseen bugs or patterns in different contexts. CES can be used to vet the suspicious success of LLMs in these tasks systematically.

翻译：本文提出CES任务，用于评估大型语言模型在模拟程序执行及将此类推理应用于编程任务中的能力。除测量执行模拟过程中变量预测的正确性外，CES引入连贯性概念，以判定模拟过程是否符合常识性执行逻辑——即使模拟过程中的预测值存在错误。这使得CES能够排除因推理捷径、幻觉或潜在数据泄露导致的疑似正确输出预测。CES还提出一种新颖的度量标准，用于在强、弱、随机三个层次上衡量具有相同或不同主路径覆盖范围的测试间的推理一致性。通过对16个大型语言模型（包括三个推理专用模型）的CES评估显示：在HumanEval数据集上达到81.42%的连贯执行模拟率，其中46.92%和53.08%分别产生正确与错误的输出预测。前沿模型如GPT-4和DeepSeek-R1表现出最多的非连贯执行推理，主要归因于自然语言捷径。尽管执行模拟相对连贯，但大型语言模型在不同测试间的推理表现存在不一致性，主要表现为随机性（48.87%）或弱一致性（45.37%），这或许解释了它们在需要路径敏感程序分析才能成功的编程任务中的薄弱环节。我们还将CES与错误预测/定位/修复任务进行对比，后者直观上需要控制流和数据流感知能力。研究发现大型语言模型几乎未将执行推理整合到错误相关任务的分析中，其成功主要源于模式匹配或自然语言捷径的固有能力（若非数据泄露）。缺乏推理能力将威胁大型语言模型处理未见错误或不同上下文模式时的泛化能力。CES可系统化验证大型语言模型在此类任务中可疑成功表现的有效性。