A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

The integration of LLMs into software engineering has catalyzed a paradigm shift from traditional rule-based systems to sophisticated agentic systems capable of autonomous problem-solving. Despite this transformation, the field lacks a comprehensive understanding of how benchmarks and solutions interconnect, hindering systematic progress and evaluation. This survey presents the first holistic analysis of LLM-empowered software engineering, bridging the critical gap between evaluation and solution approaches. We analyze 150+ recent papers and organize them into a comprehensive taxonomy spanning two major dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, covering code generation, translation, repair, and other tasks. Our analysis reveals how the field has evolved from simple prompt engineering to complex agentic systems incorporating planning and decomposition, reasoning and self-refinement, memory mechanisms, and tool augmentation. We present a unified pipeline that illustrates the complete workflow from task specification to final deliverables, demonstrating how different solution paradigms address varying complexity levels across software engineering tasks. Unlike existing surveys that focus on isolated aspects, we provide full-spectrum coverage connecting 50+ benchmarks with their corresponding solution strategies, enabling researchers to identify optimal approaches for specific evaluation criteria. Furthermore, we identify critical research gaps and propose actionable future directions, including multi-agent collaboration frameworks, self-evolving code generation systems, and integration of formal verification with LLM-based methods. This survey serves as a foundational resource for researchers and practitioners seeking to understand, evaluate, and advance LLM-empowered software engineering systems.

翻译：将大型语言模型（LLM）集成到软件工程中，已推动该领域从传统的基于规则系统向能够自主解决问题的复杂智能体系统发生范式转变。尽管经历了这一变革，该领域仍缺乏对基准测试与解决方案如何相互关联的系统性认识，这阻碍了系统性进展与评估。本综述首次对LLM赋能的软件工程进行了整体性分析，弥合了评估方法与解决方案之间的关键鸿沟。我们分析了150余篇近期论文，并将其组织成一个涵盖两大维度的综合分类体系：（1）解决方案，分为基于提示、基于微调和基于智能体的范式；（2）基准测试，覆盖代码生成、翻译、修复及其他任务。我们的分析揭示了该领域如何从简单的提示工程演变为包含规划与分解、推理与自我优化、记忆机制及工具增强的复杂智能体系统。我们提出了一个统一流程框架，阐释了从任务规约到最终交付物的完整工作流，展示了不同解决方案范式如何应对软件工程任务中不同复杂度的挑战。与现有仅关注孤立方面的综述不同，我们提供了连接50余个基准测试及其对应解决方案策略的全谱系覆盖，使研究者能够针对特定评估标准识别最优方法。此外，我们指出了关键的研究空白，并提出了可行的未来方向，包括多智能体协作框架、自进化的代码生成系统，以及形式化验证与基于LLM方法的融合。本综述为寻求理解、评估和推进LLM赋能软件工程系统的研究者与实践者提供了基础性参考资源。

相关内容

Engineering

关注 7

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering