Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.
翻译:编码智能体已成为软件工程的一种主要范式,但我们用于比较它们的基准测试设计于智能体时代之前:这些基准测试将模型、工具链和环境整合为一个单一端到端分数,通常仅针对一个参考解决方案计算,且不提供组件级别的迭代信号。我们认为当前的编码基准测试与智能体软件工程存在错位。实际中的编码智能体并非单一模型,而是一个系统工具链——由模型、工具链、上下文、环境和反馈信号共同构成,其中任何一个组件的变动都可能导致基准测试分数的变化幅度堪比相邻模型代际之间的差异。我们讨论了三个表征:(i) 基准测试分数将模型与工具链其他部分混为一谈;(ii) 针对单一参考解决方案评分会惩罚同样有效的替代方案;(iii) 缺乏单个工具组件的信号使得端到端系统分数难以迭代优化。