Large language models (LLMs) have revolutionized automated code generation, yet the evaluation of their real-world effectiveness remains limited by static benchmarks and simplistic metrics. We present ProxyWar, a novel framework that systematically assesses code generation quality by embedding LLM-generated agents within diverse, competitive game environments. Unlike existing approaches, ProxyWar evaluates not only functional correctness but also the operational characteristics of generated programs, combining automated testing, iterative code repair, and multi-agent tournaments to provide a holistic view of program behavior. Applied to a range of state-of-the-art coders and games, our approach uncovers notable discrepancies between benchmark scores and actual performance in dynamic settings, revealing overlooked limitations and opportunities for improvement. These findings highlight the need for richer, competition-based evaluation of code generation. Looking forward, ProxyWar lays a foundation for research into LLM-driven algorithm discovery, adaptive problem solving, and the study of practical efficiency and robustness, including the potential for models to outperform hand-crafted agents. The project is available at https://github.com/xinke-wang/ProxyWar.
翻译:大型语言模型(LLM)已彻底改变了自动化代码生成领域,然而对其实际效能的评估仍受限于静态基准测试和简化的度量指标。本文提出ProxyWar,一种新颖的评估框架,通过将LLM生成的智能体嵌入多样化、竞争性的游戏环境中,系统性地评估代码生成质量。与现有方法不同,ProxyWar不仅评估生成代码的功能正确性,还评估其运行特性,结合自动化测试、迭代式代码修复和多智能体锦标赛,以提供程序行为的整体视图。将本方法应用于一系列最先进的代码生成模型和游戏环境后,我们的研究发现基准测试分数与动态环境中的实际性能之间存在显著差异,揭示了以往被忽视的局限性及改进机会。这些发现凸显了需要基于更丰富、更具竞争性的评估方法来衡量代码生成能力。展望未来,ProxyWar为LLM驱动的算法发现、自适应问题求解以及实际效率与鲁棒性研究(包括模型超越手工设计智能体的潜力)奠定了基础。本项目开源地址:https://github.com/xinke-wang/ProxyWar。