We systematically evaluated the performance of seven large language models in generating programming code using various prompt strategies, programming languages, and task difficulties. GPT-4 substantially outperforms other large language models, including Gemini Ultra and Claude 2. The coding performance of GPT-4 varies considerably with different prompt strategies. In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4 employing the optimal prompt strategy outperforms 85 percent of human participants. Additionally, GPT-4 demonstrates strong capabilities in translating code between different programming languages and in learning from past errors. The computational efficiency of the code generated by GPT-4 is comparable to that of human programmers. These results suggest that GPT-4 has the potential to serve as a reliable assistant in programming code generation and software development.
翻译:我们系统评估了七种大型语言模型在使用不同提示策略、编程语言和任务难度条件下生成编程代码的性能。GPT-4在性能上显著优于包括Gemini Ultra和Claude 2在内的其他大型语言模型。GPT-4的代码生成性能随提示策略的不同而显著变化。在本研究评估的大多数LeetCode和GeeksforGeeks编程竞赛中,采用最优提示策略的GPT-4超过了85%的人类参赛者。此外,GPT-4在不同编程语言之间的代码转换以及从过往错误中学习方面展现出强大能力。GPT-4生成的代码在计算效率上与人类程序员相当。这些结果表明,GPT-4有潜力成为编程代码生成和软件开发领域的可靠助手。