Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.
翻译:摘要:使用相同编程语言(PL)的软件工程师可能说不同的自然语言(NL),反之亦然,这给沟通和工作效率造成了巨大障碍。近期研究表明,生成式预训练在计算机程序中具有有效性,但这些方法始终以英语为中心。在本工作中,我们致力于弥合大型语言模型(LLM)中多语言自然语言与多语言编程语言之间的鸿沟。我们发布了ERNIE-Code,一个涵盖116种自然语言和6种编程语言的统一预训练语言模型。我们采用了两种通用跨语言预训练方法:跨度破坏语言建模(从单语自然语言或编程语言中学习模式)以及基于枢轴的翻译语言建模(依赖多语言自然语言与编程语言的平行数据)。广泛结果表明,在代码智能的各类终端任务中(包括多语言代码到文本、文本到代码、代码到代码以及文本到文本生成),ERNIE-Code均优于先前的多语言编程语言或自然语言大语言模型。我们进一步展示了其在多语言代码摘要与文本到文本翻译中的零样本提示优势。我们已公开发布代码与预训练检查点。