EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems

As large language models (LLMs) continue to advance in programming tasks, LLM-driven coding systems have evolved from one-shot code generation into complex systems capable of iterative improvement during inference. However, existing code benchmarks primarily emphasize static correctness and implicitly assume fixed model capability during inference. As a result, they do not capture inference-time self-evolution, such as whether accuracy and efficiency improve as an agent iteratively refines its solutions. They also provide limited accounting of resource costs and rarely calibrate model performance against that of human programmers. Moreover, many benchmarks are dominated by high-resource languages, leaving cross-language robustness and long-tail language stability underexplored. Therefore, we present EvoCodeBench, a benchmark for evaluating self-evolving LLM-driven coding systems across programming languages with direct comparison to human performance. EvoCodeBench tracks performance dynamics, measuring solution correctness alongside efficiency metrics such as solving time, memory consumption, and improvement algorithmic design over repeated problem-solving attempts. To ground evaluation in a human-centered reference frame, we directly compare model performance with that of human programmers on the same tasks, enabling relative performance assessment within the human ability distribution. Furthermore, EvoCodeBench supports multiple programming languages, enabling systematic cross-language and long-tail stability analyses under a unified protocol. Our results demonstrate that self-evolving systems exhibit measurable gains in efficiency over time, and that human-relative and multi-language analyses provide insights unavailable through accuracy alone. EvoCodeBench establishes a foundation for evaluating coding intelligence in evolving LLM-driven systems.

翻译：随着大语言模型（LLM）在编程任务中的持续进步，LLM驱动的编码系统已从单次代码生成演变为能够在推理过程中进行迭代改进的复杂系统。然而，现有代码基准主要强调静态正确性，并隐含假设推理过程中模型能力固定不变。因此，这些基准未能捕捉推理时的自演化特性，例如智能体在迭代优化解决方案时准确性与效率是否提升。它们对资源成本的考量也有限，且很少将模型性能与人类程序员的表现进行校准。此外，许多基准过度集中于高资源语言，导致跨语言鲁棒性和长尾语言稳定性研究不足。为此，我们提出EvoCodeBench，这是一个用于评估跨编程语言的自演化LLM驱动编码系统的基准，并直接与人类性能进行比较。EvoCodeBench追踪性能动态，在重复解题尝试中测量解决方案正确性及效率指标（如解题时间、内存消耗和改进算法设计）。为了将评估置于以人为中心的参照系中，我们直接在相同任务上将模型性能与人类程序员的表现进行对比，从而实现在人类能力分布范围内的相对性能评估。此外，EvoCodeBench支持多种编程语言，能够在统一协议下进行系统的跨语言及长尾稳定性分析。我们的结果表明，自演化系统在效率上随时间推移呈现可测量的提升，且基于人类相对性能和多语言分析能够提供仅靠准确性无法获得的洞见。EvoCodeBench为评估演化中的LLM驱动系统的编码智能奠定了基准基础。