A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language instructions, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is instructed to update a block of code provided in a prompt. The editing instruction may ask for a feature to added or removed, describe a bug and ask for a fix, ask for a different kind of solution, or many other common code editing tasks. We introduce a carefully crafted benchmark of code editing tasks and use it evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is 8.8% better than the best open model at editing code. We also introduce a new, carefully curated, permissively licensed training set of code edits coupled with natural language instructions. Using this training set, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities.
翻译:大量研究致力于开发和评估大语言模型在多种代码合成任务中的表现,包括从自然语言指令合成代码、从代码合成测试用例以及合成代码解释。相比之下,关于大语言模型在指令式代码编辑行为方面的研究尚不充分。这类任务要求模型根据提示中提供的代码块执行更新操作,编辑指令可能涉及添加或删除功能、描述缺陷并请求修复、要求提供不同类型的解决方案,或是其他常见的代码编辑任务。我们构建了一个精心设计的代码编辑任务基准测试,并利用它评估了多款顶尖的大语言模型。评估结果揭示了当前最先进的开源模型与闭源模型之间的显著差距:例如,即使是GPT-3.5-Turbo在代码编辑能力上也比最佳开源模型高出8.8%。此外,我们提出了一套新构建的、经过精心筛选且基于宽松许可协议的训练数据集,其中包含代码编辑项及对应的自然语言指令。通过该训练集,我们证明了可以对开源代码大语言模型进行微调,从而显著提升其代码编辑能力。