Although large language models (LLMs) have been largely successful in generating functionally correct programs, conditioning models to produce efficient solutions while ensuring correctness remains a challenge. Further, unreliability in benchmarking code efficiency is a hurdle across varying hardware specifications for popular interpreted languages such as Python. In this paper, we present ECCO, a reproducible benchmark for evaluating program efficiency via two paradigms: natural language (NL) based code generation and history-based code editing. On ECCO, we adapt and thoroughly investigate the three most promising existing LLM-based approaches: in-context learning, iterative refinement with execution or NL feedback, and fine-tuning conditioned on execution and editing history. While most methods degrade functional correctness and moderately increase program efficiency, we find that adding execution information often helps maintain functional correctness, and NL feedback enhances more on efficiency. We release our benchmark to support future work on LLM-based generation of efficient code.
翻译:尽管大型语言模型(LLM)在生成功能正确的程序方面已取得显著成功,但如何在确保正确性的同时引导模型生成高效解决方案仍是一个挑战。此外,对于Python这类流行的解释型语言,在不同硬件配置下进行代码效率基准测试存在不可靠性问题。本文提出ECCO——一个通过两种范式(基于自然语言的代码生成和基于历史记录的代码编辑)评估程序效率的可复现基准测试框架。基于ECCO,我们调整并深入研究了三种最具潜力的现有LLM方法:上下文学习、基于执行或自然语言反馈的迭代优化,以及以执行和编辑历史为条件的微调。研究发现,虽然多数方法会降低功能正确性并仅适度提升程序效率,但添加执行信息通常有助于维持功能正确性,而自然语言反馈则能更显著提升效率。我们公开此基准测试框架,以支持未来基于LLM的高效代码生成研究。