Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs). However, existing benchmarks for code agent evaluation face two major limitations. First, creating high-quality project-level evaluation datasets requires extensive domain expertise, leading to prohibitive annotation costs and limited diversity. Second, while recent Agent-as-a-Judge paradigms address the rigidity of traditional unit tests by enabling flexible metrics, their reliance on In-Context Learning (ICL) with general LLMs often results in inaccurate assessments that misalign with human standards. To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse project-level tasks. Based on this, we introduce PRDBench, comprising 50 real-world Python projects across 20 domains, each with structured Product Requirement Documents (PRDs) and comprehensive criteria. Furthermore, to overcome the inaccuracy of general LLM judges, we propose a highly reliable evaluation framework powered by a specialized, fine-tuned model. Based on Qwen3-Coder-30B, our dedicated PRDJudge achieves over 90% human alignment in fixed-interface scenarios. Extensive experiments demonstrate that our suite provides a scalable, robust, and highly accurate framework for assessing state-of-the-art code agents.
翻译:近年来,代码智能体在大型语言模型(LLM)的支持下,已能实现项目级的自动化软件开发。然而,现有的代码智能体评估基准面临两大局限。首先,创建高质量的项目级评估数据集需要深厚的领域专业知识,导致标注成本极高且多样性受限。其次,尽管近期"智能体即评委"范式通过引入灵活度量缓解了传统单元测试的僵化问题,但其依赖通用LLM进行上下文学习(ICL)的方式常产生与人类标准不符的失准评估。为应对这些挑战,我们提出一种智能体驱动的基准构建流程,通过人类监督高效生成多样化的项目级任务。基于此,我们推出PRDBench——涵盖20个领域的50个真实Python项目,每个项目均配备结构化的产品需求文档(PRD)与综合评估标准。此外,为克服通用LLM评委的评估失准问题,我们提出由专用微调模型驱动的高可靠性评估框架。基于Qwen3-Coder-30B构建的专用PRDJudge模型,在固定接口场景下实现了超过90%的人类评估对齐度。大量实验表明,该评估体系为前沿代码智能体提供了可扩展、鲁棒且高精度的评估框架。