Large Code Generation Models (LCGMs) have garnered significant attention and achieved promising results across various programming tasks. However, concerns arise regarding performance when using non-English prompts, as these models are primarily trained on English-centric corpora, and most programming language tokens resemble English. Existing benchmarks often rely on English programming questions and limited manual unit test cases, inadequately assessing LCGM-generated code quality. This paper investigates code quality differences, specifically effectiveness and efficiency, when employing different natural languages as inputs, focusing on Chinese and English due to their prominent corpora and LCGM availability. Evaluating LCGM-generated code quality under bilingual inputs presents three challenges: (1) lack of high-quality bilingual programming question datasets, (2) insufficient unit test cases for comprehensive correctness verification, and (3) limited support for comparing generated code performance. To address these challenges, we curated a test suite of 52 bilingual programming questions and developed automated input generators for each. We enhanced correctness verification by sampling larger unit test cases and estimated code performance by profiling execution time relative to input size growth. Using this framework, we conducted an empirical study on six state-of-the-art LCGMs. The results revealed that LCGM-generated code exhibits varying bilingual correctness on an average of 10.5% of tasks, with 39.5% of correct code showing diverse bilingual performance differences. Our findings suggested LCGMs may not consistently generate high-quality code across different languages, providing insights for future research directions.
翻译:大规模代码生成模型(LCGMs)已在各类编程任务中获得广泛关注并取得显著成果。然而,当使用非英语提示时,其性能表现引发担忧,因为这些模型主要基于以英语为中心的语料库进行训练,且大多数编程语言标记与英语相似。现有基准测试通常依赖英语编程问题和有限的人工单元测试用例,未能充分评估LCGM生成代码的质量。本文研究了采用不同自然语言(重点聚焦于汉语和英语,因其语料库规模显著且LCGM支持度较高)作为输入时,生成代码在质量上的差异,特别是有效性与效率维度。评估双语输入下LCGM生成代码质量面临三大挑战:(1) 缺乏高质量的双语编程问题数据集;(2) 用于全面正确性验证的单元测试用例不足;(3) 对生成代码性能比较的支持有限。为应对这些挑战,我们构建了包含52个双语编程问题的测试集,并为每个问题开发了自动化输入生成器。我们通过采样更大规模的单元测试用例来增强正确性验证,并通过分析执行时间随输入规模增长的相对变化来评估代码性能。基于该框架,我们对六个前沿LCGM进行了实证研究。结果表明,LCGM生成的代码平均在10.5%的任务中表现出双语正确性差异,且在39.5%的正确代码中呈现出显著的双语性能差异。我们的发现表明,LCGM可能无法在不同语言间稳定生成高质量代码,这为未来研究方向提供了重要启示。