Large Language Models (LLMs) can generate code, but can they generate fast code for complex, real-world software systems? In this study, we investigate this question using a dataset of 65 tasks mined from performance-critical open-source Java projects. Unlike prior studies, which focused on algorithmic puzzles, we conduct experiments on actual performance-sensitive production code and employ developer-written JMH benchmarks to rigorously validate performance gains against human baselines. Our results reveal a nuanced reality -- although LLMs demonstrate a surprisingly high capability to solve these complex engineering problems, their solutions suffer from extreme volatility and still lag behind human developers on average. Consequently, we find that the current benchmarks based on algorithmic tasks yields an overly optimistic assessment of LLM capabilities. We trace this real-world performance gap to two primary limitations: first, LLMs struggle to autonomously pinpoint performance hotspots, and second, even with explicit guidance, they often fall short of synthesizing optimal algorithmic improvements. Our results highlight the need to move beyond static code generation towards more complex agent-based systems that are able to profile and observe runtime behavior for performance improvement.
翻译:大型语言模型(LLM)能够生成代码,但它们能否为复杂的真实世界软件系统生成高效代码?本研究基于从性能关键型开源Java项目中挖掘的65个任务数据集,探究该问题。与以往聚焦于算法谜题的研究不同,我们针对实际性能敏感型生产代码进行实验,并采用开发者编写的JMH基准测试,严格验证其性能提升效果与人类基线的对比。研究结果揭示了复杂现实:尽管LLM在解决复杂工程问题方面展现出令人惊讶的高能力,但其解决方案存在极端不稳定性,且平均表现仍落后于人类开发者。由此发现,当前基于算法任务的基准测试对LLM能力的评估过于乐观。我们将这种现实性能差距归因于两个主要局限:其一,LLM难以自主识别性能热点;其二,即使有明确指引,它们在整合最优算法改进方案时仍常显不足。研究结果凸显了需从静态代码生成转向更复杂的智能体系统——这类系统能够通过性能剖析和运行时行为观察来实现性能优化。