Code understanding and generation have fast become some of the most popular applications of language models (LMs). Nonetheless, research on multilingual aspects of Code-LMs (i.e., LMs for code generation) such as cross-lingual transfer between different programming languages, language-specific data augmentation, and post-hoc LM adaptation, alongside exploitation of data sources other than the original textual content, has been much sparser than for their natural language counterparts. In particular, most mainstream Code-LMs have been pre-trained on source code files alone. In this work, we investigate the prospect of leveraging readily available compiler intermediate representations (IR) - shared across programming languages - to improve the multilingual capabilities of Code-LMs and facilitate cross-lingual transfer. To this end, we first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files coupled with respective intermediate representations. Next, starting from various base Code-LMs (ranging in size from 1.1B to 7.3B parameters), we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to (1) learn the IR language and (2) align the IR constructs with respective constructs of various programming languages. Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics, including prompt robustness, multilingual code completion, code understanding, and instruction following.
翻译:代码理解与生成已迅速成为语言模型最热门的应用之一。然而,与自然语言处理领域相比,关于代码语言模型(即用于代码生成的LM)在多语言方面的研究,例如不同编程语言间的跨语言迁移、特定语言的数据增强、事后LM适配,以及利用除原始文本内容之外的数据源等工作,则要稀疏得多。特别是,大多数主流代码语言模型仅仅在源代码文件上进行预训练。在本研究中,我们探索利用现成的、跨编程语言共享的编译器中间表示来提升代码语言模型的多语言能力,并促进跨语言迁移。为此,我们首先编译了SLTrans,这是一个包含近400万个自包含源代码文件及其对应的中间表示的并行数据集。接着,从不同的基础代码语言模型(参数量从1.1B到7.3B不等)出发,我们在SLTrans上继续进行因果语言模型训练,迫使代码语言模型(1)学习IR语言,(2)将IR结构与各种编程语言的相应结构对齐。我们最终得到的模型命名为IRCoder,在包括提示鲁棒性、多语言代码补全、代码理解和指令遵循在内的广泛代码生成任务和指标上,均展现出显著且一致的性能提升。