Deep learning-based code generation has completely transformed the way developers write programs today. Existing approaches to code generation have focused either on the Sequence-to-Sequence paradigm, which generates target code as a sequence of tokens, or the Sequence-to-Tree paradigm, which outputs code as a sequence of actions. While these two paradigms are intuitively complementary, their combination has not been previously explored. By comparing the code generated under these two paradigms, we find that integrating them holds significant potential. In this paper, we propose UniGenCoder for code-related generation tasks, which consists of a shared encoder, a shared decoder with a minimal set of additional parameters to unify two paradigms, and a selector that dynamically chooses optimal paradigm for each instance. Also, during the model training, we first perform the multi-task learning and distillation strategies to facilitate knowledge transfer between two paradigms, and then leverage contrastive learning to train the selector. Experimental results on the text-to-code and code-to-code generation tasks demonstrate the effectiveness of our proposed model. We release our code at https://github.com/DeepLearnXMU/UniGenCoder.
翻译:基于深度学习的代码生成已彻底改变了当今开发人员编写程序的方式。现有的代码生成方法主要聚焦于序列到序列范式(将目标代码生成为标记序列)或序列到树范式(将代码输出为动作序列)。尽管这两种范式在直觉上具有互补性,但二者的结合此前尚未被探索。通过比较这两种范式生成的代码,我们发现其融合具有显著潜力。本文针对代码相关生成任务提出UniGenCoder模型,其核心架构包含共享编码器、共享解码器(通过最小化附加参数实现双范式统一)以及为每个实例动态选择最优范式的选择器。在模型训练阶段,我们首先采用多任务学习与知识蒸馏策略促进范式间知识迁移,随后利用对比学习训练选择器。在文本到代码和代码到代码生成任务上的实验结果表明了所提模型的有效性。代码已发布于https://github.com/DeepLearnXMU/UniGenCoder。