A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language instructions, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is instructed to update a block of code provided in a prompt. The editing instruction may ask for a feature to added or removed, describe a bug and ask for a fix, ask for a different kind of solution, or many other common code editing tasks. We introduce a carefully crafted benchmark of code editing tasks and use it evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is 8.8% better than the best open model at editing code. We also introduce a new, carefully curated, permissively licensed training set of code edits coupled with natural language instructions. Using this training set, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities.
翻译:大量研究致力于开发和评估用于各种代码合成任务的大型语言模型。这些任务包括从自然语言指令合成代码、从代码合成测试,以及合成代码解释。相比之下,基于指令的代码编辑在LLM中的行为研究尚不充分。这类任务要求模型根据提示中提供的指令更新代码块。编辑指令可能要求添加或删除某个功能、描述一个错误并要求修复、要求提供不同类型的解决方案,或其他常见的代码编辑任务。我们引入了一个精心设计的代码编辑任务基准,并用于评估多种前沿LLM。我们的评估揭示了开源模型和闭源模型在能力上的显著差距。例如,即使是GPT-3.5-Turbo在代码编辑上也比最优秀的开源模型高出8.8%。我们还引入了一个新的、经过精心筛选、采用宽松许可的代码编辑训练集,并配有自然语言指令。利用该训练集,我们证明了可以通过微调开源代码LLM来显著提升其代码编辑能力。