ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \textbf{LLM-as-a-Developer}, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the developer constant and varying only the framework, generation effort becomes a quantitative proxy for API usability and the resulting agents provide a controlled measure of framework effectiveness. We implement this in \textbf{ADK Arena}, a fully automated pipeline with per-framework Docker isolation, a three-level validation pipeline, and benchmark adapters for SWE-bench, $τ^2$-bench, Terminal-Bench, and MCP-Atlas. Evaluating all 51 popular Python ADK frameworks (204 agent--benchmark pairs), we find that: (1)~generation succeeds for 57\% of runs, and its cost varies 5.6$\times$ across frameworks (\$0.6 to \$3.4 per agent), a quantitative proxy for API complexity, though cost alone does not predict success; (2)~no single framework dominates: the best single-benchmark ADK agents resolve up to 80\% of tasks and can even \emph{beat} general-purpose frontier coding agents at a fraction of the cost, yet the median framework resolves only 32\%; (3)~across information-source ablations, genuine framework usage stays within a narrow 28--40\% band (highest with raw source access and still 33\% with no reference material at all), indicating that documentation, source code, and parametric knowledge are largely substitutable rather than any one being a hard bottleneck.

翻译：代理开发工具包（Agent Development Kits，ADKs）作为构建基于大语言模型的自主代理的SDK级框架，其迅猛发展已远超学界对框架选择如何影响代理性能的实证认知。我们提出**大语言模型即开发者（LLM-as-a-Developer）**方法论，用大语言模型编码代理取代人类开发者：该代理通过文档学习各框架应用程序接口（API）、编写代理代码，并通过验证-反馈循环迭代修复直至测试通过。通过固定开发者变量、仅改变框架变量，生成工作量即可作为API可用性的量化代理指标，而生成的代理则提供框架效能的受控度量标准。我们基于此方法构建了**ADK Arena**全自动评估流水线，包含各框架的Docker隔离环境、三级验证管道，以及面向SWE-bench、$τ^2$-bench、Terminal-Bench和MCP-Atlas的基准适配器。对全部51个主流Python ADK框架（204个代理-基准组合）的评估结果显示：（1）57%的运行成功生成代理，跨框架生成成本差异达5.6倍（每代理0.6至3.4美元），该成本可作为API复杂度的量化代理指标，但成本本身不能预测成功概率；（2）不存在主导性框架：最优单一基准ADK代理可解决高达80%的任务，甚至能以极低代价**击败**通用前沿编码代理，但中位数框架仅能解决32%的任务；（3）在信息源消融实验中，真正的框架使用率维持在28-40%的狭窄区间（原始源代码访问时最高，无参考材料时仍达33%），表明文档、源代码与参数化知识之间具有高度可替代性，不存在单一硬性瓶颈。