Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at \url{https://github.com/FloatAI/humaneval-xl}.
翻译:大型语言模型(LLMs)在根据文本提示生成代码方面取得了显著进展。然而,现有基准主要关注将英语提示翻译为多语言代码,或仅限于极少数自然语言(NLs)。这些基准忽视了大规模多语言NL到多语言代码的广阔图景,导致多语言LLMs评估中存在关键空缺。为此,我们提出HumanEval-XL,一个专门为弥补这一缺陷而设计的大规模多语言代码生成基准。HumanEval-XL建立了23种NL与12种编程语言(PLs)之间的关联,包含22,080个提示,平均每个提示有8.33个测试用例。通过确保跨多种NL和PL的并行数据,HumanEval-XL为多语言LLMs提供了全面的评估平台,可评估其对不同NL的理解能力。我们的工作作为填补多语言代码生成领域NL泛化评估空白的一项开创性步骤。我们将在\url{https://github.com/FloatAI/humaneval-xl} 公开发布评估代码与数据。