Large Language Models (LLMs) are widely adopted for assisting in software development tasks, yet their performance evaluations have narrowly focused on the functional correctness of generated code. Human programmers, however, require LLM-generated code to be not only correct but also optimally efficient. We propose PerfCodeGen, a training-free framework that enhances the performance of LLM-generated code by incorporating feedback based on runtime during test case execution into the self-refinement iterations. With PerfCodeGen, we achieve speedups for a significantly higher proportion of problems compared to using the base LLM with sophisticated prompting techniques. Applied to open language models like Phi-3-mini, PerfCodeGen achieves runtime efficiency comparable to prompting powerful closed models like GPT-4. We achieve state-of-the-art runtime efficiency on benchmarks such as HumanEval, MBPP, and APPS, frequently surpassing the ground truth reference solutions with PerfCodeGen using GPT-3.5 and GPT-4. Additionally, we demonstrate the effectiveness of our approach in enhancing code quality across a range of open LLMs of varying sizes including Phi-3-mini, Llama 3 8B, Mixtral 8x7B, Command R, and Llama 3 70B.
翻译:大语言模型(LLMs)已被广泛用于辅助软件开发任务,然而其性能评估目前主要集中于生成代码的功能正确性。然而,人类程序员不仅要求LLM生成的代码正确,还要求其高效。我们提出了PerfCodeGen,一种无需训练即可提升LLM生成代码性能的框架。该框架通过在自优化迭代中融入基于测试用例执行时的运行时反馈来实现性能提升。相较于使用基础LLM配合复杂提示技术,PerfCodeGen能在显著更高比例的问题上实现加速。应用于如Phi-3-mini等开源语言模型时,PerfCodeGen能达到与提示GPT-4等强大闭源模型相当的运行时效率。我们在HumanEval、MBPP和APPS等基准测试中实现了最先进的运行时效率,使用PerfCodeGen配合GPT-3.5和GPT-4时,其性能经常超越基准真值参考解决方案。此外,我们还证明了该方法在提升一系列不同规模的开源LLMs(包括Phi-3-mini、Llama 3 8B、Mixtral 8x7B、Command R和Llama 3 70B)的代码质量方面的有效性。