Exploring Multi-Lingual Bias of Large Code Models in Code Generation

Code generation aims to synthesize code and fulfill functional requirements based on natural language (NL) specifications, which can greatly improve development efficiency. In the era of large language models (LLMs), large code models (LCMs) have been recently proposed to generate source code. LCMs can generate highly feasible solutions for programming problems described in natural language. Despite the effectiveness, we observe a noticeable multilingual bias in the generation performance of LCMs. Specifically, LCMs demonstrate proficiency in generating solutions when provided with instructions in English, yet may falter when faced with semantically equivalent instructions in other NLs such as Chinese. Moreover, the ability of LCMs to generate code exhibits variety across different programming languages (PLs), such as Python and C++. The observed phenomenon indicates the presence of multi-lingual bias within the generative capabilities of LCMs, which has remained unexplored. In this paper, we aim to investigate the multi-lingual bias that exists in current LCMs. First, we initiate our investigation by constructing the first multi-lingual evaluation benchmark X-HumanEval-X, enabling us to systematically evaluate the extent of multi-lingual bias that exists in current LCMs. In our large-scale experiments on nine popular LCMs, we observe a pronounced multi-lingual bias of LCMs in code generation, including multi-NL and multi-PL bias. Specifically, when using Chinese instructions, the code generation capabilities of LCMs decrease by at least 13% in terms of the Pass@1 metric. Furthermore, LCMs perform variously across different programming languages, e.g., the performance gap between Python and C++ reaches as high as 20.9%. ...

翻译：代码生成旨在根据自然语言规范合成代码并实现功能需求，这将显著提升开发效率。在大语言模型时代，近期提出了大型代码模型用于生成源代码。大语言代码模型能够为自然语言描述的编程问题生成高度可行的解决方案。尽管效果显著，我们观察到大型代码模型在生成性能方面存在明显的多语言偏差。具体而言，当使用英语指令时，大型代码模型能够熟练生成解决方案，但在面对中文等其他自然语言中语义等价的指令时，其表现可能有所下降。此外，大型代码模型在不同编程语言（如Python和C++）间的代码生成能力也存在差异。这一现象表明大型代码模型的生成能力中存在尚未探索的多语言偏差。本文旨在研究当前大型代码模型中存在的多语言偏差。首先，我们通过构建首个多语言评估基准X-HumanEval-X来启动研究，从而系统评估当前大型代码模型中多语言偏差的程度。在对九个主流大型代码模型进行的大规模实验中，我们观察到大型代码模型在代码生成中存在显著的多语言偏差，包括多自然语言偏差和多种编程语言偏差。具体而言，当使用中文指令时，大型代码模型的代码生成能力在Pass@1指标上至少下降13%。此外，大型代码模型在不同编程语言上的表现存在差异，例如Python与C++之间的性能差距高达20.9%。……