Large Language Models (LLMs) can generate code, but can they generate fast code? In this paper, we study this question using a dataset of 65 real-world tasks mined from open-source Java programs. We specifically select tasks where developers achieved significant speedups, and employ an automated pipeline to generate patches for these issues using two leading LLMs under four prompt variations. By rigorously benchmarking the results against the baseline and human-authored solutions, we demonstrate that LLM-generated code indeed improves performance over the baseline in most cases. However, patches proposed by human developers outperform LLM fixes by a statistically significant margin, indicating that LLMs often fall short of finding truly optimal solutions. We further find that LLM solutions are semantically identical or similar to the developer optimization idea in approximately two-thirds of cases, whereas they propose a more original idea in the remaining one-third. However, these original ideas only occasionally yield substantial performance gains.
翻译:大型语言模型(LLMs)能够生成代码,但它们能否生成高效的代码?本文通过一个从开源Java程序中挖掘出的65个真实世界任务数据集来研究这一问题。我们特别选取了开发者已实现显著加速的任务,并采用自动化流程,使用两种领先的LLM在四种提示变体下为这些问题生成补丁。通过将结果与基准版本及人工编写的解决方案进行严格基准测试,我们证明在大多数情况下,LLM生成的代码确实能提升基准性能。然而,由开发者提出的补丁在统计意义上显著优于LLM修复方案,这表明LLM通常难以找到真正最优的解决方案。我们进一步发现,在大约三分之二的情况下,LLM提出的解决方案在语义上与开发者的优化思路相同或相似,而在其余三分之一的情况下,它们提出了更具原创性的思路。然而,这些原创性思路仅偶尔能带来显著的性能提升。