A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.
翻译:大量研究致力于开发和评估大型语言模型在各种代码合成任务中的表现,包括从自然语言合成代码、从代码合成测试以及从代码合成解释。相比之下,关于指令式代码编辑中LLMs行为的研究仍显不足。这类任务中,模型会获得一段代码及修改指令,编辑指令可能要求添加或删除功能、描述故障并请求修复,或要求提供不同类型的解决方案。我们构建了一份精心设计的代码编辑任务基准,并用于评估多个前沿LLMs。评估结果揭示了当前最先进的开源模型与闭源模型之间存在显著能力差距。例如,即便是GPT-3.5-Turbo在代码编辑任务上的表现也优于最佳开源模型。我们还引入了全新、精心筛选且采用宽松许可协议的训练数据集,其中包含代码编辑任务及其对应的自然语言指令。基于该训练数据集,我们证明可通过微调开源代码LLMs显著提升其代码编辑能力,从而缩小开源与闭源模型之间的差距。所有代码、数据及模型均已开源:https://github.com/nuprl/CanItEdit。