Imagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.
翻译:想象一位开发者只能修改其最后一行代码,那么他在编写一个函数时,需要从头开始重写多少次才能得到正确结果?自然语言到代码生成的自回归模型存在类似局限:它们难以轻易修正早期生成的标记。我们提出CodeFusion——一种预训练扩散代码生成模型,通过迭代去噪基于编码自然语言条件生成的完整程序来解决该问题。我们在Bash、Python和Microsoft Excel条件格式(CF)规则的自然语言到代码生成任务上评估了CodeFusion。实验表明,CodeFusion(7500万参数)在Top-1准确率上与最先进的自回归系统(3.5亿-1750亿参数)表现相当,并在Top-3和Top-5准确率上超越后者,这得益于其在多样性与质量之间更优的平衡。