We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities even on mono-lingual settings. Furthermore, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks. Overall, our benchmarks represents a significant step towards a deeper understanding of language models' code generation abilities. We publicly release our code and datasets at https://github.com/amazon-research/mxeval.
翻译:我们提出了代码生成模型评估的新基准:MBXP、多语种HumanEval及MathQA-X。这些数据集涵盖超过10种编程语言,通过可扩展的转换框架生成,该框架将原始Python数据集中的提示和测试用例转译为目标语言对应的数据。利用这些基准,我们能够以多语言方式评估代码生成模型的性能,并发现语言模型在跨领域语言上的泛化能力、多语言模型相较单语言模型的优势、少样本提示教会模型新语言的能力,以及即使在单语言设置下仍具备的零样本翻译能力。此外,我们使用代码生成模型进行大规模自举学习,获取多种语言的合成规范解决方案,这些方案可用于其他代码相关评估任务,如代码插入、鲁棒性或摘要任务。总体而言,我们的基准标志着向深入理解语言模型代码生成能力迈出的重要一步。我们已在https://github.com/amazon-research/mxeval 公开代码和数据集。