Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 large language models (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average \textbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are \textbf{13.89} and \textbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at https://huggingface.co/spaces/EffiBench/effibench-leaderboard.
翻译:代码生成模型已日益成为辅助软件开发不可或缺的工具。尽管当前研究已深入探讨了代码生成模型所产生代码的正确性,但一个对绿色计算与可持续发展至关重要的维度却常被忽视。本文提出EffiBench,这是一个包含1,000个效率关键型编程问题的基准测试集,用于评估代码生成模型所生成代码的运行效率。EffiBench涵盖了多样化的LeetCode编程问题。每个问题均配有一个可执行的人工编写规范解法,该解法在LeetCode解决方案排行榜中达到了当前最优效率。基于EffiBench,我们对42个大型语言模型(35个开源模型与7个闭源模型)生成高效代码的能力进行了实证检验。评估结果表明,LLM生成代码的效率普遍低于人工编写的规范解法。例如,GPT-4生成代码的平均执行时间达到人工规范解法的\textbf{3.12}倍。在最极端情况下,GPT-4生成代码的执行时间与总内存占用分别达到规范解法的\textbf{13.89}倍与\textbf{43.92}倍。EffiBench的源代码已发布于https://github.com/huangd1999/EffiBench。我们同时在https://huggingface.co/spaces/EffiBench/effibench-leaderboard提供了性能排行榜。