The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \textbf{LLM-as-a-Developer}, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the developer constant and varying only the framework, generation effort becomes a quantitative proxy for API usability and the resulting agents provide a controlled measure of framework effectiveness. We implement this in \textbf{ADK Arena}, a fully automated pipeline with per-framework Docker isolation, a three-level validation pipeline, and benchmark adapters for SWE-bench, $τ^2$-bench, Terminal-Bench, and MCP-Atlas. Evaluating all 51 popular Python ADK frameworks (204 agent--benchmark pairs), we find that: (1)~generation succeeds for 57\% of runs, and its cost varies 5.6$\times$ across frameworks (\$0.6 to \$3.4 per agent), a quantitative proxy for API complexity, though cost alone does not predict success; (2)~no single framework dominates: the best single-benchmark ADK agents resolve up to 80\% of tasks and can even \emph{beat} general-purpose frontier coding agents at a fraction of the cost, yet the median framework resolves only 32\%; (3)~across information-source ablations, genuine framework usage stays within a narrow 28--40\% band (highest with raw source access and still 33\% with no reference material at all), indicating that documentation, source code, and parametric knowledge are largely substitutable rather than any one being a hard bottleneck.
翻译:代理开发工具包(Agent Development Kits,ADKs)作为构建基于大语言模型的自主代理的SDK级框架,其迅猛发展已远超学界对框架选择如何影响代理性能的实证认知。我们提出**大语言模型即开发者(LLM-as-a-Developer)**方法论,用大语言模型编码代理取代人类开发者:该代理通过文档学习各框架应用程序接口(API)、编写代理代码,并通过验证-反馈循环迭代修复直至测试通过。通过固定开发者变量、仅改变框架变量,生成工作量即可作为API可用性的量化代理指标,而生成的代理则提供框架效能的受控度量标准。我们基于此方法构建了**ADK Arena**全自动评估流水线,包含各框架的Docker隔离环境、三级验证管道,以及面向SWE-bench、$τ^2$-bench、Terminal-Bench和MCP-Atlas的基准适配器。对全部51个主流Python ADK框架(204个代理-基准组合)的评估结果显示:(1)57%的运行成功生成代理,跨框架生成成本差异达5.6倍(每代理0.6至3.4美元),该成本可作为API复杂度的量化代理指标,但成本本身不能预测成功概率;(2)不存在主导性框架:最优单一基准ADK代理可解决高达80%的任务,甚至能以极低代价**击败**通用前沿编码代理,但中位数框架仅能解决32%的任务;(3)在信息源消融实验中,真正的框架使用率维持在28-40%的狭窄区间(原始源代码访问时最高,无参考材料时仍达33%),表明文档、源代码与参数化知识之间具有高度可替代性,不存在单一硬性瓶颈。