Current evaluation for Large Language Model (LLM) code agents predominantly focus on generating functional code in single-turn scenarios, which fails to evaluate the agent's capability for continuous code optimization and multi-turn iterative development. To bridge this gap, we introduce CATArena, a framework designed to evaluate the evolutionary capabilities of code agents via iterative tournaments. Agents engage in multi-turn tournaments and continuously refine their code through self-reflection and peer-learning based on comprehensive execution feedback. For evaluation, we propose a dual-metric system to decouple static generation proficiency from evolutionary potential. Extensive experiments reveal that an agent's evolutionary potential is not strictly correlated with its initial proficiency. Our analysis further reveals that current agents struggle to concurrently leverage both peer-learning and self-reflection for effective performance gains. Furthermore, the results validate CATArena's high extensibility and resistance to variance tasks, establishing it as a continuous and reliable standard for assessing the evolutionary capability of LLM code agents.
翻译:当前针对大语言模型(LLM)代码智能体的评估主要集中于单轮场景下生成功能性代码,这无法评估智能体在持续代码优化与多轮迭代开发中的能力。为弥补这一空白,我们提出了CATArena,一个通过迭代锦标赛评估代码智能体进化能力的框架。智能体参与多轮锦标赛,并基于全面的执行反馈,通过自我反思与同伴学习持续优化其代码。为进行评估,我们提出了一种双指标系统,以解耦静态生成熟练度与进化潜力。大量实验表明,智能体的进化潜力与其初始熟练度并非严格相关。我们的分析进一步揭示,当前智能体难以同时利用同伴学习与自我反思来有效提升性能。此外,实验结果验证了CATArena的高可扩展性与对变异任务的鲁棒性,使其成为评估LLM代码智能体进化能力的持续且可靠的标准。