面向代码生成的最先进大语言模型全面评估 (Holistic Evaluation of State-of-the-Art LLMs for Code Generation) - 专知论文

会员服务 ·

0

代码 · 代码生成 · 模型性能 · 语言模型 · 算法 ·

2025 年 12 月 19 日

Holistic Evaluation of State-of-the-Art LLMs for Code Generation

翻译：面向代码生成的最先进大语言模型全面评估

Le Zhang,Suresh Kothari

from arxiv, 13 pages, 9 figures, 6 tables

This study presents a comprehensive empirical evaluation of six state-of-the-art large language models (LLMs) for code generation, including both general-purpose and code-specialized models. Using a dataset of 944 real-world LeetCode problems across five programming languages, we assess model performance using rigorous metrics: compile-time errors, runtime errors, functional failures, and algorithmic suboptimalities. The results reveal significant performance variations, with DeepSeek-R1 and GPT-4.1 consistently outperform others in terms of correctness, efficiency, and robustness. Through detailed case studies, we identify common failure scenarios such as syntax errors, logical flaws, and suboptimal algorithms, highlighting the critical role of prompt engineering and human oversight in improving results. Based on these findings, we provide actionable recommendations for developers and practitioners, emphasizing that successful LLM deployment depends on careful model selection, effective prompt design, and context-aware usage to ensure reliable code generation in real-world software development tasks.

翻译：本研究对六种最先进的大语言模型（LLMs）进行了全面的实证评估，涵盖通用模型与代码专用模型。通过使用包含五种编程语言、总计944道真实LeetCode问题的数据集，我们采用严谨的指标评估模型性能：编译时错误、运行时错误、功能故障及算法次优性。结果显示模型性能存在显著差异，其中DeepSeek-R1与GPT-4.1在正确性、效率与鲁棒性方面持续优于其他模型。通过详细案例研究，我们识别出语法错误、逻辑缺陷与次优算法等常见故障场景，凸显了提示工程与人工监督对改进结果的关键作用。基于这些发现，我们为开发者和实践者提供了可操作的改进建议，强调成功部署LLMs需依赖审慎的模型选择、有效的提示设计以及情境感知的使用方式，以确保在实际软件开发任务中实现可靠的代码生成。

0

相关内容

代码（Code）是专知网的一个重要知识资料文档板块，旨在整理收录论文源代码、复现代码，经典工程代码等，便于用户查阅下载使用。

DeepSeek模型综述：V1 V2 V3 R1-Zero

DeepSeek模型综述：V1 V2 V3 R1-Zero

专知会员服务

116+阅读 · 2025年2月11日

多语言大型语言模型：资源、分类和前沿综述

多语言大型语言模型：资源、分类和前沿综述

专知会员服务

53+阅读 · 2024年4月9日

【ICML2023】BLIP-2:基于冻结图像编码器和大型语言模型的Bootstrapping语言-图像预训练

【ICML2023】BLIP-2:基于冻结图像编码器和大型语言模型的Bootstrapping语言-图像预训练

专知会员服务

30+阅读 · 2023年5月1日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【Amazon】使用预训练Transformer模型进行数据增强

【Amazon】使用预训练Transformer模型进行数据增强

专知

12+阅读 · 2020年3月6日

TensorFlow 2.0官方Transformer教程 (Attention is All you Need)

TensorFlow 2.0官方Transformer教程 (Attention is All you Need)

专知

54+阅读 · 2019年4月12日

Mask R-CNN 论文笔记

Mask R-CNN 论文笔记

统计学习与视觉计算组

11+阅读 · 2018年3月22日

深度学习目标检测模型全面综述：Faster R-CNN、R-FCN和SSD

深度学习目标检测模型全面综述：Faster R-CNN、R-FCN和SSD

深度学习世界

10+阅读 · 2017年9月18日

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

AI研习社

18+阅读 · 2017年8月31日

组合测试用例优先排序算法及选择策略研究

国家自然科学基金

9+阅读 · 2015年12月31日

基于高斯过程模型的多示例多标记学习算法研究

国家自然科学基金

14+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

高维数据下的模型平均方法

国家自然科学基金

6+阅读 · 2014年12月31日

基于模型驱动的并发建模语言Apla+设计及其可靠性研究

国家自然科学基金

3+阅读 · 2014年12月31日

An Empirical Investigation of Robustness in Large Language Models under Tabular Distortions

Arxiv

0+阅读 · 1月8日

Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning

Arxiv

0+阅读 · 1月8日

Evaluating Small Decoder-Only Language Models for Grammar Correction and Text Simplification

Arxiv

0+阅读 · 1月7日

Multi-Agent LLM Orchestration Achieves Deterministic, High-Quality Decision Support for Incident Response

Arxiv

0+阅读 · 1月7日

Transparent Semantic Change Detection with Dependency-Based Profiles

Arxiv

0+阅读 · 1月6日

VIP会员

文章信息

相关主题

相关VIP内容

DeepSeek模型综述：V1 V2 V3 R1-Zero

DeepSeek模型综述：V1 V2 V3 R1-Zero

专知会员服务

116+阅读 · 2025年2月11日

多语言大型语言模型：资源、分类和前沿综述

多语言大型语言模型：资源、分类和前沿综述

专知会员服务

53+阅读 · 2024年4月9日

【ICML2023】BLIP-2:基于冻结图像编码器和大型语言模型的Bootstrapping语言-图像预训练

【ICML2023】BLIP-2:基于冻结图像编码器和大型语言模型的Bootstrapping语言-图像预训练

专知会员服务

30+阅读 · 2023年5月1日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

热门VIP内容

开通专知VIP会员享更多权益服务

《面向小规模遥感应用引入思维链推理与多模态小语言模型》

《大国竞争时代的美国太空竞争力》50页报告

网络中心战：未来冲突

《自主无人机不会取代战斗机飞行员，将成为其僚机：协同作战飞机是下一代无人作战飞机》报告

相关资讯

【Amazon】使用预训练Transformer模型进行数据增强

【Amazon】使用预训练Transformer模型进行数据增强

专知

12+阅读 · 2020年3月6日

TensorFlow 2.0官方Transformer教程 (Attention is All you Need)

TensorFlow 2.0官方Transformer教程 (Attention is All you Need)

专知

54+阅读 · 2019年4月12日

Mask R-CNN 论文笔记

Mask R-CNN 论文笔记

统计学习与视觉计算组

11+阅读 · 2018年3月22日

深度学习目标检测模型全面综述：Faster R-CNN、R-FCN和SSD

深度学习目标检测模型全面综述：Faster R-CNN、R-FCN和SSD

深度学习世界

10+阅读 · 2017年9月18日

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

AI研习社

18+阅读 · 2017年8月31日

相关论文

An Empirical Investigation of Robustness in Large Language Models under Tabular Distortions

Arxiv

0+阅读 · 1月8日

Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning

Arxiv

0+阅读 · 1月8日

Evaluating Small Decoder-Only Language Models for Grammar Correction and Text Simplification

Arxiv

0+阅读 · 1月7日

Multi-Agent LLM Orchestration Achieves Deterministic, High-Quality Decision Support for Incident Response

Arxiv

0+阅读 · 1月7日

Transparent Semantic Change Detection with Dependency-Based Profiles

Arxiv

0+阅读 · 1月6日

相关基金

组合测试用例优先排序算法及选择策略研究

国家自然科学基金

9+阅读 · 2015年12月31日

基于高斯过程模型的多示例多标记学习算法研究

国家自然科学基金

14+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

高维数据下的模型平均方法

国家自然科学基金

6+阅读 · 2014年12月31日

基于模型驱动的并发建模语言Apla+设计及其可靠性研究

国家自然科学基金

3+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员