Code Large Language Models (Code LLMs) have excelled at tasks like code completion but often miss deeper semantics such as execution effects and dynamic states. This paper aims to bridge the gap between Code LLMs' reliance on static text data and the need for semantic understanding for complex tasks like debugging and program repair. We introduce a novel strategy, monologue reasoning, to train Code LLMs to reason comprehensive semantics, encompassing high-level functional descriptions, local execution effects of individual statements, and overall input/output behavior, thereby linking static code text with dynamic execution states. We begin by collecting PyX, a clean Python corpus of fully executable code samples with functional descriptions and test cases. We propose training Code LLMs not only to write code but also to understand code semantics by reasoning about key properties, constraints, and execution behaviors using natural language, mimicking human verbal debugging, i.e., rubber-duck debugging. This approach led to the development of SemCoder, a Code LLM with only 6.7B parameters, which shows competitive performance with GPT-3.5-turbo on code generation and execution reasoning tasks. SemCoder achieves 79.3% on HumanEval (GPT-3.5-turbo: 76.8%), 63.6% on CRUXEval-I (GPT-3.5-turbo: 50.3%), and 63.9% on CRUXEval-O (GPT-3.5-turbo: 59.0%). We also study the effectiveness of SemCoder's monologue-style execution reasoning compared to concrete scratchpad reasoning, showing that our approach integrates semantics from multiple dimensions more smoothly. Finally, we demonstrate the potential of applying learned semantics to improve Code LLMs' debugging and self-refining capabilities. Our data, code, and models are available at: https://github.com/ARiSE-Lab/SemCoder.
翻译:代码大语言模型(Code LLMs)在代码补全等任务上表现出色,但常常忽略更深层次的语义,如执行效果和动态状态。本文旨在弥合代码大语言模型对静态文本数据的依赖与复杂任务(如调试和程序修复)对语义理解需求之间的差距。我们引入了一种新颖的策略——独白式推理,以训练代码大语言模型进行全面的语义推理,涵盖高级功能描述、单个语句的局部执行效果以及整体的输入/输出行为,从而将静态代码文本与动态执行状态联系起来。我们首先收集了PyX,这是一个包含完整可执行代码样本、功能描述和测试用例的干净Python语料库。我们提出训练代码大语言模型不仅要编写代码,还要通过使用自然语言推理关键属性、约束和执行行为来理解代码语义,模仿人类的口头调试,即“小黄鸭调试法”。这种方法促成了SemCoder的开发,这是一个仅拥有67亿参数的代码大语言模型,在代码生成和执行推理任务上展现出与GPT-3.5-turbo相竞争的性能。SemCoder在HumanEval上达到79.3%(GPT-3.5-turbo:76.8%),在CRUXEval-I上达到63.6%(GPT-3.5-turbo:50.3%),在CRUXEval-O上达到63.9%(GPT-3.5-turbo:59.0%)。我们还研究了SemCoder的独白式执行推理相较于具体草稿式推理的有效性,表明我们的方法能更平滑地整合来自多个维度的语义。最后,我们展示了应用习得的语义来提升代码大语言模型调试和自我优化能力的潜力。我们的数据、代码和模型可在以下网址获取:https://github.com/ARiSE-Lab/SemCoder。