Code Generation aims at generating relevant code fragments according to given natural language descriptions. In the process of software development, there exist a large number of repetitive and low-tech code writing tasks, so code generation has received a lot of attention among academia and industry for assisting developers in coding. In fact, it has also been one of the key concerns in the field of software engineering to make machines understand users' requirements and write programs on their own. The recent development of deep learning techniques especially pre-training models make the code generation task achieve promising performance. In this paper, we systematically review the current work on deep learning-based code generation and classify the current deep learning-based code generation methods into three categories: methods based on code features, methods incorporated with retrieval, and methods incorporated with post-processing. The first category refers to the methods that use deep learning algorithms for code generation based on code features, and the second and third categories of methods improve the performance of the methods in the first category. In this paper, the existing research results of each category of methods are systematically reviewed, summarized and commented. The paper then summarizes and analyzes the corpus and the popular evaluation metrics used in the existing code generation work. Finally, the paper summarizes the overall literature review and provides a prospect on future research directions worthy of attention.
翻译:代码生成旨在根据给定的自然语言描述生成相应的代码片段。在软件开发过程中,存在大量重复且技术含量较低的代码编写任务,因此代码生成在学术界和工业界受到了广泛关注,以辅助开发者进行编码。事实上,让机器理解用户的需求并自主编写程序一直是软件工程领域的核心关注点之一。近年来,深度学习技术尤其是预训练模型的发展,使代码生成任务取得了令人瞩目的性能。本文系统性地回顾了当前基于深度学习的代码生成研究工作,并将现有基于深度学习的代码生成方法划分为三类:基于代码特征的方法、融合检索的方法以及融合后处理的方法。第一类指利用深度学习算法基于代码特征进行代码生成的方法,而第二类和第三类方法旨在提升第一类方法的性能。本文对每类方法现有的研究成果进行了系统性回顾、总结与评述。随后,本文总结并分析了现有代码生成工作中使用的语料库及流行的评估指标。最后,本文对整体文献综述进行了总结,并对未来值得关注的研究方向进行了展望。