Large pre-trained language models have recently been expanded and applied to programming language tasks with great success, often through further pre-training of a strictly-natural language model--where training sequences typically contain both natural and (linearised) programming language. Such approaches effectively map both modalities of the sequence into the same embedding space. However, programming language keywords (e.g. "while") often have very strictly defined semantics. As such, transfer learning from their natural language usage may not necessarily be beneficial to their code application and vise versa. Assuming an already pre-trained language model, in this work we investigate how sequence tokens can be adapted and represented differently, depending on which modality they belong to, and to the ultimate benefit of the downstream task. We experiment with separating embedding spaces between modalities during further model pre-training with modality-relative training objectives. We focus on text-to-code generation and observe consistent improvements across two backbone models and two test sets, measuring pass@$k$ and a novel incremental variation.
翻译:大型预训练语言模型最近被扩展并成功应用于编程语言任务,通常通过进一步预训练严格自然语言模型——其中训练序列通常同时包含自然语言和(线性化的)编程语言。这类方法有效地将序列的两种模态映射到同一嵌入空间。然而,编程语言关键字(如"while")往往具有严格定义的语义。因此,从其自然语言用法进行迁移学习未必对代码应用有益,反之亦然。在已有预训练语言模型的基础上,本研究探讨如何根据序列标记所属的模态对其进行差异化的适配与表示,以最终提升下游任务性能。我们尝试在进一步模型预训练过程中,采用模态相关训练目标来分离不同模态的嵌入空间。聚焦于文本到代码生成任务,我们在两种骨干模型和两个测试集上观察到一致的性能提升,并采用pass@$k$及一种新型增量变体进行评估。