Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

Federico Cassano,Luisa Li,Akul Sethi,Noah Shinn,Abby Brennan-Jones,Jacob Ginesin,Edward Berman,George Chakhnashvili,Anton Lozhkov,Carolyn Jane Anderson,Arjun Guha

A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.

翻译：大量研究致力于开发和评估大型语言模型在各种代码合成任务中的表现，包括从自然语言合成代码、从代码合成测试以及从代码合成解释。相比之下，关于指令式代码编辑中LLMs行为的研究仍显不足。这类任务中，模型会获得一段代码及修改指令，编辑指令可能要求添加或删除功能、描述故障并请求修复，或要求提供不同类型的解决方案。我们构建了一份精心设计的代码编辑任务基准，并用于评估多个前沿LLMs。评估结果揭示了当前最先进的开源模型与闭源模型之间存在显著能力差距。例如，即便是GPT-3.5-Turbo在代码编辑任务上的表现也优于最佳开源模型。我们还引入了全新、精心筛选且采用宽松许可协议的训练数据集，其中包含代码编辑任务及其对应的自然语言指令。基于该训练数据集，我们证明可通过微调开源代码LLMs显著提升其代码编辑能力，从而缩小开源与闭源模型之间的差距。所有代码、数据及模型均已开源：https://github.com/nuprl/CanItEdit。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日