The emergence of large language models (LLMs) has significantly pushed the frontiers of program synthesis. Advancement of LLM-based program synthesis calls for a thorough evaluation of LLM-generated code. Most evaluation frameworks focus on the (functional) correctness of generated code; efficiency, as an important measure of code quality, has been overlooked in existing evaluations. In this work, we develop ENAMEL (EfficeNcy AutoMatic EvaLuator), a rigorous and high-standard benchmark for evaluating the capability of LLMs in generating efficient code. Firstly, we propose a new efficiency metric called eff@k, which generalizes the pass@k metric from correctness to efficiency and appropriately handles right-censored execution time. Furthermore, we derive an unbiased and variance-reduced estimator of eff@k via Rao--Blackwellization; we also provide a numerically stable implementation for the new estimator. Secondly, to set a high-standard for efficiency evaluation, we employ a human expert to design best algorithms and implementations as our reference solutions of efficiency, many of which are much more efficient than existing canonical solutions in HumanEval and HumanEval+. Moreover, to ensure a rigorous evaluation, we employ a human expert to curate strong test case generators to filter out wrong code and differentiate suboptimal algorithms. An extensive study across 30 popular LLMs using our benchmark ENAMEL shows that LLMs still fall short of generating expert-level efficient code. Using two subsets of our problem set, we demonstrate that such deficiency is because current LLMs struggle in designing advanced algorithms and are barely aware of implementation optimization. Our benchmark is publicly available at https://github.com/q-rz/enamel .
翻译:大型语言模型(LLMs)的出现显著推动了程序合成的前沿发展。基于LLM的程序合成进展需要对LLM生成的代码进行全面评估。现有评估框架大多关注生成代码的(功能)正确性;而效率作为衡量代码质量的重要指标,在现有评估中却被忽视。本研究开发了ENAMEL(高效自动评估器),这是一个严谨且高标准的基准测试,用于评估LLM生成高效代码的能力。首先,我们提出了一种名为eff@k的新效率度量标准,该标准将pass@k度量从正确性推广到效率领域,并妥善处理了右删失的执行时间。此外,我们通过Rao-Blackwell化推导出eff@k的无偏且方差缩减的估计量,并为新估计量提供了数值稳定的实现方案。其次,为建立高效率评估的高标准,我们聘请人类专家设计最优算法和实现作为效率的参考解决方案,其中许多方案比HumanEval和HumanEval+中的现有规范解决方案高效得多。为确保评估的严谨性,我们聘请人类专家精心设计强测试用例生成器,以过滤错误代码并区分次优算法。使用我们的基准测试ENAMEL对30个主流LLM开展的广泛研究表明,LLM在生成专家级高效代码方面仍有不足。通过我们问题集中的两个子集,我们证明这种缺陷源于当前LLM在设计高级算法方面存在困难,且几乎不具备实现优化的意识。我们的基准测试已在https://github.com/q-rz/enamel 公开提供。