Imagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.
翻译:设想一位开发者只能修改其最后一行代码,那么在编写一个函数时,他们需要从头开始重写多少次才能得到正确结果?从自然语言生成代码的自回归模型存在类似局限:它们难以轻松回溯并修正先前生成的token。我们提出CodeFusion——一种预训练的扩散代码生成模型,该模型通过对编码自然语言条件约束下的完整程序进行迭代去噪,突破了这一限制。我们在Bash、Python及Microsoft Excel条件格式化(CF)规则的从自然语言到代码生成任务上评估了CodeFusion。实验表明,CodeFusion(7500万参数)在top-1准确率上与最先进的自回归系统(3.5亿至1750亿参数)性能相当,并因其在多样性与质量间更优的平衡,在top-3和top-5准确率上超越后者。