Code generation models have increasingly become integral to aiding software development, offering assistance in tasks such as code completion, debugging, and code translation. Although current research has thoroughly examined the correctness of code produced by code generation models, a vital aspect, i.e., the efficiency of the generated code, has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems for assessing the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution. With EffiBench, we empirically examine the capability of 21 Large Language Models (13 open-sourced and 8 closed-sourced) in generating efficient code. The results demonstrate that GPT-4-turbo generates the most efficient code, significantly outperforming Palm-2-chat-bison, Claude-instant-1, Gemini-pro, GPT-4, and GPT-3.5. Nevertheless, its code efficiency is still worse than the efficiency of human-written canonical solutions. In particular, the average and worst execution time of GPT-4-turbo generated code is 1.69 and 45.49 times that of the canonical solutions.
翻译:代码生成模型在辅助软件开发中日益不可或缺,为代码补全、调试和代码翻译等任务提供了支持。尽管当前研究已深入探讨了代码生成模型所生成代码的正确性,但一个关键方面——生成代码的效率——却常被忽视。本文提出了EffiBench,一个包含1000个效率关键型编码问题的基准测试,用于评估代码生成模型所生成代码的效率。EffiBench涵盖一系列多样化的LeetCode编码问题,每个问题均配有一个可执行的、由人工编写的规范解法。借助EffiBench,我们实证检验了21个大型语言模型(13个开源模型和8个闭源模型)生成高效代码的能力。结果表明,GPT-4-turbo生成了最高效的代码,显著优于Palm-2-chat-bison、Claude-instant-1、Gemini-pro、GPT-4和GPT-3.5。然而,其代码效率仍逊于人工编写的规范解法。具体而言,GPT-4-turbo生成代码的平均执行时间和最差执行时间分别是规范解法的1.69倍和45.49倍。