Due to the development of pre-trained language models, automated code generation techniques have shown great promise in recent years. However, the generated code is difficult to meet the syntactic constraints of the target language, especially in the case of Turducken-style code, where declarative code snippets are embedded within imperative programs. In this study, we summarize the lack of syntactic constraints into three significant challenges: (1) the efficient representation of syntactic constraints, (2) the effective integration of syntactic information, and (3) the scalable syntax-first decoding algorithm. To address these challenges, we propose a syntax-guided multi-task learning approach TurduckenGen. Specifically, we first explicitly append the type information to the code tokens to capture the representation of syntactic constraints. Then we formalize code generation with syntactic constraint representation as an auxiliary task to enable the model to learn the syntactic constraints of the code. Finally, the syntactically correct code is selected accurately from the multiple candidates with the help of the compiler feedback. Extensive experiments and comprehensive analysis demonstrate the effectiveness and general applicability of our approach after being compared with six state-of-the-art baselines on two Turducken-style code datasets. Finally, we conducted a human study and found the code quality generated by our approach is better than baselines in terms of code readability and semantic similarity.
翻译:由于预训练语言模型的发展,自动化代码生成技术在近年来展现出巨大潜力。然而,生成的代码难以满足目标语言的语法约束,尤其是在“套娃式”代码(将声明式代码片段嵌入命令式程序)的情况下。本研究将语法约束缺失问题总结为三大挑战:(1)语法约束的高效表示;(2)语法信息的有效融合;(3)可扩展的语法优先解码算法。为应对这些挑战,我们提出了一种语法引导的多任务学习方法TurduckenGen。具体而言,我们首先显式地将类型信息附加到代码标记上,以捕获语法约束的表示;其次,将带有语法约束表示的代码生成形式化为辅助任务,使模型能够学习代码的语法约束;最后,借助编译器反馈从多个候选代码中准确选择语法正确的代码。在两种“套娃式”代码数据集上与六个最先进基线方法的对比实验和全面分析表明,该方法具有有效性和通用性。此外,我们开展了一项人工研究,发现本方法生成的代码在可读性和语义相似性方面优于基线方法。