We introduce Differential Performance Evaluation (DPE), a framework designed to reliably evaluate Large Language Models (LLMs) for efficient code generation. Traditional coding benchmarks often fail to provide reliable insights into code efficiency, due to their reliance on simplistic test inputs and the absence of effective compound metrics. DPE addresses these issues by focusing on efficiency-demanding programming tasks and establishing an insightful compound metric for performance evaluation. DPE operates in two phases: To curate efficiency datasets, it selects efficiency-demanding tasks from existing coding benchmarks and generates computationally expensive inputs to stress the efficiency of LLM solutions. To assess the code efficiency, DPE profiles the new solution and compares it globally against a set of reference solutions that exhibit distinct efficiency levels, where the matched level defines its efficiency score. As a proof of concept, we use DPE to create EvalPerf, a benchmark with 121 performance-challenging coding tasks. Our comprehensive evaluation draws interesting findings on the efficiency impact of model sizes, instruction tuning, and prompting. For example, while the scaling law fails to account for code efficiency, general instruction tuning benefits both code correctness and efficiency. We also evaluate the evaluation by examining the effectiveness of DPE, showing that EvalPerf is reliable and convenient to use even across platforms.
翻译:我们提出了差分性能评估(DPE)框架,旨在可靠地评估大型语言模型(LLMs)的高效代码生成能力。传统编码基准测试由于依赖简化的测试输入且缺乏有效的复合指标,往往无法为代码效率提供可靠评估。DPE通过聚焦效率敏感型编程任务并建立具有洞察力的复合性能评估指标来解决这些问题。DPE的运行分为两个阶段:在构建效率数据集时,它从现有编码基准中筛选效率敏感型任务,并生成计算密集型输入以测试LLM解决方案的效率边界;在评估代码效率时,DPE对新解决方案进行性能剖析,并将其与一组具有不同效率水平的参考解决方案进行全局比对,匹配到的效率等级即定义其效率得分。作为概念验证,我们运用DPE创建了EvalPerf基准测试集,包含121个性能挑战型编码任务。我们的综合评估揭示了模型规模、指令微调和提示策略对代码效率的影响规律:例如,虽然扩展定律无法解释代码效率的变化,但通用指令微调同时提升了代码正确性与执行效率。我们还通过检验DPE的有效性对评估方法本身进行验证,结果表明EvalPerf基准测试集具有跨平台使用的可靠性与便捷性。