Deep code generation is a topic of deep learning for software engineering (DL4SE), which adopts neural models to generate code for the intended functions. Since end-to-end neural methods lack domain knowledge and software hierarchy awareness, they tend to perform poorly w.r.t project-level tasks. To systematically explore the potential improvements of code generation, we let it participate in the whole top-down development from \emph{expressibles} to \emph{executables}, which is possible in limited scopes. In the process, it benefits from massive samples, features, and knowledge. As the foundation, we suggest building a taxonomy on code data, namely code taxonomy, leveraging the categorization of code information. Moreover, we introduce a three-layer semantic pyramid (SP) to associate text data and code data. It identifies the information of different abstraction levels, and thus introduces the domain knowledge on development and reveals the hierarchy of software. Furthermore, we propose a semantic pyramid framework (SPF) as the approach, focusing on software of high modularity and low complexity. SPF divides the code generation process into stages and reserves spots for potential interactions. In addition, we conceived preliminary applications in software development to confirm the neuro-symbolic framework.
翻译:深度代码生成是软件工程深度学习(DL4SE)领域的一个课题,它采用神经模型为目标功能生成代码。由于端到端的神经方法缺乏领域知识和软件层次结构感知,它们在项目级任务上往往表现不佳。为了系统地探索代码生成的潜在改进,我们让代码生成参与到从\emph{可表达}到\emph{可执行}的整个自上而下开发过程中,这在受限范围内是可行的。在此过程中,代码生成受益于大量的样本、特征和知识。作为基础,我们建议在代码数据上建立分类体系,即代码分类法,利用代码信息的分类方法。此外,我们引入了一个三层语义金字塔(SP)来关联文本数据和代码数据。它识别不同抽象层次的信息,从而引入开发领域的知识并揭示软件的层次结构。进一步地,我们提出了一个语义金字塔框架(SPF)作为方法,专注于高模块化和低复杂度的软件。SPF将代码生成过程划分为多个阶段,并为潜在的交互保留空间。此外,我们设计了软件开发中的初步应用来验证这一神经符号框架。