Amidst the recent strides in evaluating Large Language Models for Code (Code-LLMs), existing benchmarks have mainly focused on functional correctness, overlooking the importance of computational efficiency. To fill the gap, we present Mercury, the first computational efficiency benchmark for Code-LLMs. It comprises 1,889 Python tasks, each with adequate solutions to support a runtime distribution. Based on the distribution, we introduce a new metric Beyond, which computes a runtime-percentile-weighted Pass score to reflect functional correctness and computational efficiency simultaneously. On Mercury, leading Code-LLMs can achieve 67% on Pass, while less than 50% on Beyond. Given that an ideal Beyond score would be aligned with the Pass score, it indicates that while Code-LLMs exhibit impressive capabilities in generating functionally correct code, there remains a notable gap in their efficiency. Finally, our empirical experiments reveal that Direct Preference Optimization (DPO) serves as a robust baseline for enhancing computational efficiency compared with Supervised Fine Tuning (SFT), which paves a promising avenue for future exploration of efficient code generation.
翻译:摘要:在评估面向代码的大型语言模型(Code-LLMs)的最新进展中,现有基准测试主要聚焦于功能正确性,忽略了计算效率的重要性。为填补这一空白,我们提出Mercury——首个面向Code-LLMs的计算效率基准测试。该基准包含1,889个Python任务,每个任务均配备充足解决方案以支持运行时分布。基于该分布,我们引入新指标Beyond,该指标通过计算运行时百分位加权的通过率(Pass score),同时反映功能正确性与计算效率。在Mercury上,领先的Code-LLMs在Pass指标上可达67%,而在Beyond指标上不足50%。鉴于理想情况下Beyond得分应与Pass得分对齐,这表明尽管Code-LLMs在生成功能正确的代码方面表现卓越,但其效率仍存在显著差距。最后,实证实验揭示,与监督微调(SFT)相比,直接偏好优化(DPO)可作为提升计算效率的稳健基线,这为未来探索高效代码生成开辟了可行方向。