This study presents a comprehensive empirical evaluation of six state-of-the-art large language models (LLMs) for code generation, including both general-purpose and code-specialized models. Using a dataset of 944 real-world LeetCode problems across five programming languages, we assess model performance using rigorous metrics: compile-time errors, runtime errors, functional failures, and algorithmic suboptimalities. The results reveal significant performance variations, with DeepSeek-R1 and GPT-4.1 consistently outperform others in terms of correctness, efficiency, and robustness. Through detailed case studies, we identify common failure scenarios such as syntax errors, logical flaws, and suboptimal algorithms, highlighting the critical role of prompt engineering and human oversight in improving results. Based on these findings, we provide actionable recommendations for developers and practitioners, emphasizing that successful LLM deployment depends on careful model selection, effective prompt design, and context-aware usage to ensure reliable code generation in real-world software development tasks.
翻译:本研究对六种最先进的大语言模型(LLMs)进行了全面的实证评估,涵盖通用模型与代码专用模型。通过使用包含五种编程语言、总计944道真实LeetCode问题的数据集,我们采用严谨的指标评估模型性能:编译时错误、运行时错误、功能故障及算法次优性。结果显示模型性能存在显著差异,其中DeepSeek-R1与GPT-4.1在正确性、效率与鲁棒性方面持续优于其他模型。通过详细案例研究,我们识别出语法错误、逻辑缺陷与次优算法等常见故障场景,凸显了提示工程与人工监督对改进结果的关键作用。基于这些发现,我们为开发者和实践者提供了可操作的改进建议,强调成功部署LLMs需依赖审慎的模型选择、有效的提示设计以及情境感知的使用方式,以确保在实际软件开发任务中实现可靠的代码生成。