A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language instructions, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is instructed to update a block of code provided in a prompt. The editing instruction may ask for a feature to added or removed, describe a bug and ask for a fix, ask for a different kind of solution, or many other common code editing tasks. We introduce a carefully crafted benchmark of code editing tasks and use it evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is 8.8% better than the best open model at editing code. We also introduce a new, carefully curated, permissively licensed training set of code edits coupled with natural language instructions. Using this training set, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities.
翻译:大量研究致力于开发并评估用于各种代码合成任务的大型语言模型。这些任务包括从自然语言指令合成代码、从代码合成测试用例,以及合成代码解释。相比之下,对基于指令的代码编辑行为(即要求模型更新提示中提供的代码块的场景)的研究尚不充分。此类编辑指令可能涉及新增或删除功能、描述缺陷并请求修复、要求提供不同解决方案或执行其他常见代码编辑任务。我们构建了一个精心设计的代码编辑任务基准,并使用它评估了数种前沿大型语言模型。评估结果揭示了当前最先进的开源模型与闭源模型之间存在显著能力差距。例如,即使是GPT-3.5-Turbo,在代码编辑任务上的表现也优于最佳开源模型8.8%。我们还引入了一个全新、精心筛选且采用宽松许可协议的代码编辑训练集,其中包含自然语言指令与对应代码编辑的配对数据。实验表明,使用该训练集微调开源代码语言模型可显著提升其代码编辑能力。