Large Language Models (LLMs) have achieved state-of-the-art performance across software engineering tasks, from code generation to translation. However, we identify and systematically evaluate a critical failure mode: Programming Language Confusion (PLC) -- the generation of code in unintended languages despite explicit instructions. Through evaluation of 10 popular LLMs across six multilingual datasets (LiveCodeBench, BabelCode variants, HumanEval-XL, and McEval), we demonstrate that PLC is pervasive, with some specialized models exhibiting the highest confusion rates. Our analysis reveals that PLC is not random noise but reflects systematic patterns: models consistently generate syntactically valid code even when it deviates from language specifications. This behavior produces distinct language migration patterns, most notably a strong default to Python and systematic shifts between syntactically similar language pairs (e.g., C#/Java). These migrations reflect statistical preferences learned from training data rather than goal-directed reasoning. We demonstrate that explicit language keywords provide the most effective mitigation, while natural language instructions have limited influence on model behavior. Furthermore, model quantization -- though essential for practical deployment -- significantly amplifies PLC and degrades syntactic stability in complex tasks. Our findings underscore that language fidelity should be treated as a core evaluation dimension for code LLMs. We advocate for standardized benchmarks and prompt formats with explicit language constraints to enable more reliable assessment and foster the development of robust, multilingual code generation systems.
翻译:大语言模型(LLMs)在从代码生成到翻译的各类软件工程任务中均取得了最先进的性能。然而,我们发现并系统性地评估了一种关键失效模式:编程语言混淆——即模型在接收到明确指令的情况下,仍生成了非目标语言的代码。通过对10个流行LLM在六个多语言数据集(LiveCodeBench、BabelCode变体、HumanEval-XL和McEval)上的评估,我们证明PLC现象普遍存在,一些专用模型甚至表现出最高的混淆率。我们的分析表明,PLC并非随机噪声,而是反映了系统性的模式:即使生成的代码偏离了语言规范,模型也始终生成语法有效的代码。这种行为产生了独特的语言迁移模式,最显著的是对Python的强烈默认倾向,以及在语法相似的语言对(例如C#/Java)之间发生的系统性偏移。这些迁移反映了模型从训练数据中学到的统计偏好,而非基于目标的推理。我们证明,显式的语言关键字提供了最有效的缓解措施,而自然语言指令对模型行为的影响有限。此外,模型量化——尽管对实际部署至关重要——显著加剧了PLC,并在复杂任务中降低了语法稳定性。我们的研究结果强调,语言保真度应被视为评估代码LLM的一个核心维度。我们主张建立具有明确语言约束的标准化基准测试和提示格式,以实现更可靠的评估,并促进开发鲁棒的多语言代码生成系统。