Recently, LLM agents have made rapid progress in improving their programming capabilities. However, existing benchmarks lack the ability to automatically evaluate from users' perspective, and also lack the explainability of the results of LLM agents' code generation capabilities. Thus, we introduce ProjectEval, a new benchmark for LLM agents project-level code generation's automated evaluation by simulating user interaction. ProjectEval is constructed by LLM with human reviewing. It has three different level inputs of natural languages or code skeletons. ProjectEval can evaluate the generated projects by user interaction simulation for execution, and by code similarity through existing objective indicators. Through ProjectEval, we find that systematic engineering project code, overall understanding of the project and comprehensive analysis capability are the keys for LLM agents to achieve practical projects. Our findings and benchmark provide valuable insights for developing more effective programming agents that can be deployed in future real-world production.
翻译:近期,大型语言模型智能体在提升编程能力方面取得了快速进展。然而,现有基准缺乏从用户视角进行自动评估的能力,同时也缺乏对大型语言模型智能体代码生成能力结果的可解释性。为此,我们提出了ProjectEval,这是一个通过模拟用户交互来对大型语言模型智能体的项目级代码生成进行自动评估的新基准。ProjectEval由大型语言模型构建并经过人工审核,包含自然语言或代码骨架的三种不同层级输入。ProjectEval能够通过用户交互模拟执行,以及通过现有客观指标进行代码相似性分析,来评估生成的项目。通过ProjectEval,我们发现系统性工程化项目代码、对项目的整体理解以及综合分析能力是大型语言模型智能体实现实用化项目的关键。我们的研究结果与基准为开发更高效、可部署于未来实际生产的编程智能体提供了有价值的见解。