Large pre-trained code generation models, such as OpenAI Codex, can generate syntax- and function-correct code, making the coding of programmers more productive and our pursuit of artificial general intelligence closer. In this paper, we introduce CodeGeeX, a multilingual model with 13 billion parameters for code generation. CodeGeeX is pre-trained on 850 billion tokens of 23 programming languages as of June 2022. Our extensive experiments suggest that CodeGeeX outperforms multilingual code models of similar scale for both the tasks of code generation and translation on HumanEval-X. Building upon HumanEval (Python only), we develop the HumanEval-X benchmark for evaluating multilingual models by hand-writing the solutions in C++, Java, JavaScript, and Go. In addition, we build CodeGeeX-based extensions on Visual Studio Code, JetBrains, and Cloud Studio, generating 4.7 billion tokens for tens of thousands of active users per week. Our user study demonstrates that CodeGeeX can help to increase coding efficiency for 83.4% of its users. Finally, CodeGeeX is publicly accessible and in Sep. 2022, we open-sourced its code, model weights (the version of 850B tokens), API, extensions, and HumanEval-X at https://github.com/THUDM/CodeGeeX.
翻译:大型预训练代码生成模型,例如OpenAI Codex,能够生成语法和功能正确的代码,从而提升程序员的编码效率,并使我们更接近通用人工智能的目标。本文介绍CodeGeeX,一个拥有130亿参数、用于代码生成的多语言模型。截至2022年6月,CodeGeeX在涵盖23种编程语言的8500亿个令牌上进行了预训练。我们的大量实验表明,在HumanEval-X的代码生成和代码翻译任务上,CodeGeeX均优于类似规模的多语言代码模型。基于仅支持Python的HumanEval,我们开发了HumanEval-X基准测试,通过手写C++、Java、JavaScript和Go的解决方案来评估多语言模型。此外,我们在Visual Studio Code、JetBrains和Cloud Studio上构建了基于CodeGeeX的扩展,每周为数万活跃用户生成47亿个令牌。我们的用户研究表明,CodeGeeX能够帮助83.4%的用户提高编码效率。最后,CodeGeeX已公开可用,并于2022年9月在https://github.com/THUDM/CodeGeeX 开源了其代码、模型权重(8500亿令牌版本)、API、扩展以及HumanEval-X。