This paper focuses on Code Generation task that aims at generating relevant code fragments according to given natural language descriptions. In the process of software development, developers often encounter two scenarios. One is requested to write a large amount of repetitive and low-technical code for implementing common functionalities. The other is writing code that depends on specific task requirements, which may necessitate the use of external resources such as documentation or other tools. Therefore, code generation has received a lot of attention among academia and industry for assisting developers in coding. In fact, it has also been one of the key concerns in the field of software engineering to make machines understand users' requirements and write programs on their own. The recent development of deep learning techniques especially pre-training models make the code generation task achieve promising performance. In this paper, we systematically review the current work on deep learning-based code generation and classify the current deep learning-based code generation methods into three categories: methods based on code features, methods incorporated with retrieval, and methods incorporated with post-processing. The first category refers to the methods that use deep learning algorithms for code generation based on code features, and the second and third categories of methods improve the performance of the methods in the first category. In this paper, the existing research results of each category of methods are systematically reviewed, summarized and commented. Besides, the paper summarizes and analyzes the corpus and the popular evaluation metrics used in the existing code generation work. Finally, the paper summarizes the overall literature review and provides a prospect on future research directions worthy of attention.
翻译:本文聚焦于代码生成任务,该任务旨在根据给定的自然语言描述生成相关的代码片段。在软件开发过程中,开发者常面临两种场景:一是编写大量重复且技术含量低的代码以实现常见功能;二是根据特定任务需求编写代码,这可能需借助文档或其他工具等外部资源。因此,代码生成因能辅助开发者编码而受到学术界和工业界的广泛关注。事实上,让机器理解用户需求并自主编写程序一直是软件工程领域的核心关注点之一。近年来,深度学习技术(尤其是预训练模型)的发展使代码生成任务取得了显著性能提升。本文系统梳理了当前基于深度学习的代码生成工作,并将其分为三类:基于代码特征的方法、结合检索的方法以及结合后处理的方法。第一类指利用深度学习算法基于代码特征生成代码的方法,第二类和第三类方法则旨在改进第一类方法的性能。本文对每类方法的现有研究成果进行了系统综述、总结与评述。此外,还归纳分析了现有代码生成工作中使用的语料库与常用评估指标。最后,本文对整体文献综述进行总结,并对未来值得关注的研究方向进行了展望。