We systematically evaluated the performance of seven large language models in generating programming code using various prompt strategies, programming languages, and task difficulties. GPT-4 substantially outperforms other large language models, including Gemini Ultra and Claude 2. The coding performance of GPT-4 varies considerably with different prompt strategies. In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4 employing the optimal prompt strategy outperforms 85 percent of human participants. Additionally, GPT-4 demonstrates strong capabilities in translating code between different programming languages and in learning from past errors. The computational efficiency of the code generated by GPT-4 is comparable to that of human programmers. These results suggest that GPT-4 has the potential to serve as a reliable assistant in programming code generation and software development.
翻译:我们系统评估了七种大型语言模型在使用不同提示策略、编程语言和任务难度时生成编程代码的性能。GPT-4显著优于其他大型语言模型,包括Gemini Ultra和Claude 2。GPT-4的编码性能随提示策略的不同而有显著差异。在本研究评估的大多数LeetCode和GeeksforGeeks编程竞赛中,采用最优提示策略的GPT-4表现优于85%的人类参与者。此外,GPT-4在不同编程语言间的代码转换能力以及从历史错误中学习的能力表现出色。GPT-4生成代码的计算效率与人类程序员相当。这些结果表明,GPT-4有潜力成为编程代码生成和软件开发领域的可靠助手。