There has been a recent surge of interest in automating software engineering tasks using deep learning. This paper addresses the problem of code generation, where the goal is to generate target code given source code in a different language or a natural language description. Most state-of-the-art deep learning models for code generation use training strategies primarily designed for natural language. However, understanding and generating code requires a more rigorous comprehension of the code syntax and semantics. With this motivation, we develop an encoder-decoder Transformer model where both the encoder and decoder are explicitly trained to recognize the syntax and data flow in the source and target codes, respectively. We not only make the encoder structure-aware by leveraging the source code's syntax tree and data flow graph, but we also support the decoder in preserving the syntax and data flow of the target code by introducing two novel auxiliary tasks: AST (Abstract Syntax Tree) paths prediction and data flow prediction. To the best of our knowledge, this is the first work to introduce a structure-aware Transformer decoder that models both syntax and data flow to enhance the quality of generated code. The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks in the CodeXGLUE benchmark, and improves over baselines of similar size on the APPS code generation benchmark. Our code is publicly available at https://github.com/reddy-lab-code-research/StructCoder/.
翻译:近年来,利用深度学习自动化软件工程任务的研究兴趣持续增长。本文针对代码生成问题展开研究,旨在将不同语言的源代码或自然语言描述转换为目标代码。现有最先进的代码生成深度学习模型主要采用为自然语言设计的训练策略,但代码的理解与生成需要更严格的语法和语义分析。基于这一动机,我们开发了一种编码器-解码器Transformer模型,其中编码器和解码器分别经过显式训练以识别源代码和目标代码的语法结构与数据流。我们不仅通过利用源代码的语法树和数据流图增强编码器的结构感知能力,还通过引入两项新型辅助任务——抽象语法树路径预测与数据流预测——支持解码器保留目标代码的语法与数据流结构。据我们所知,这是首个提出结构感知Transformer解码器以同时建模语法与数据流提升代码生成质量的工作。所提出的StructCoder模型在CodeXGLUE基准测试的代码翻译和文本到代码生成任务中取得最优性能,并在APPS代码生成基准测试中优于同等规模的基线方法。我们的代码已开源在https://github.com/reddy-lab-code-research/StructCoder/。