A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is hindered by a lack of comprehensive understanding of how benchmarks and solutions interconnect. This survey addresses this gap by providing the first holistic analysis of LLM-powered software engineering, offering insights into evaluation methodologies and solution paradigms. We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair. Our analysis highlights the evolution from simple prompt engineering to sophisticated agentic systems incorporating capabilities like planning, reasoning, memory mechanisms, and tool augmentation. To contextualize this progress, we present a unified pipeline illustrating the workflow from task specification to deliverables, detailing how different solution paradigms address various complexity levels. Unlike prior surveys that focus narrowly on specific aspects, this work connects 50+ benchmarks to their corresponding solution strategies, enabling researchers to identify optimal approaches for diverse evaluation criteria. We also identify critical research gaps and propose future directions, including multi-agent collaboration, self-evolving systems, and formal verification integration. This survey serves as a foundational guide for advancing LLM-driven software engineering. We maintain a GitHub repository that continuously updates the reviewed and related papers at https://github.com/lisaGuojl/LLM-Agent-SE-Survey.

翻译：将大语言模型（LLMs）集成到软件工程中，推动了从传统基于规则的系统向能够解决复杂问题的自主代理系统的转变。然而，由于缺乏对基准与解决方案之间相互关联的系统性理解，这一进展受到阻碍。本综述通过首次对大语言模型驱动的软件工程进行整体分析，弥补了这一空白，为评估方法和解决方案范式提供了见解。我们回顾了150多篇近期论文，并提出了一个沿两个关键维度的分类法：（1）解决方案，分为基于提示、基于微调和基于代理的范式；（2）基准，包括代码生成、翻译和修复等任务。我们的分析强调了从简单的提示工程到包含规划、推理、记忆机制和工具增强等能力的复杂代理系统的演进。为了阐明这一进展，我们提出了一个统一的流程，说明了从任务规范到交付成果的工作流程，详细阐述了不同解决方案范式如何应对各种复杂度级别。与先前仅狭隘关注特定方面的综述不同，本研究将50多个基准与其对应的解决策略联系起来，使研究人员能够针对不同的评估标准确定最优方法。我们还指出了关键的研究空白并提出了未来方向，包括多智能体协作、自进化系统以及形式化验证的集成。本综述可作为推进大语言模型驱动软件工程的基础指南。我们在GitHub上维护了一个存储库，持续更新已综述及相关论文，地址为：https://github.com/lisaGuojl/LLM-Agent-SE-Survey。

相关内容

Engineering

关注 7

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日