Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first generative model that is able to directly perform zero-shot code infilling, which we evaluate on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. The InCoder models and code are publicly released. https://sites.google.com/view/incoder-code-models
翻译:代码很少以单一的从左到右方式编写,而是经过反复编辑与优化。我们提出InCoder,一个统一的生成模型,既能执行程序合成(通过从左到右生成),也能完成编辑(通过填充)。InCoder在大规模宽松许可代码语料库上训练生成代码文件,其中代码区域被随机遮蔽并移至每个文件末尾,从而支持双向上下文的代码填充。该模型是首个能够直接执行零样本代码填充的生成模型,我们在类型推断、注释生成和变量重命名等具有挑战性的任务上对其进行了评估。我们发现,基于双向上下文的条件能力显著提升了这些任务的性能,同时在与同等规模预训练的仅从左到右模型相比时,仍能在标准程序合成基准上保持相当的表现。InCoder模型及代码已公开发布。https://sites.google.com/view/incoder-code-models