We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.
翻译:我们提出DecisionBench——一个面向长程自主工作流中涌现委托的基准测试框架。该框架包含以下固定组件:任务集(GAIA、tau-bench、BFCL多轮交互)、模型池(11个模型,涵盖7个供应商系列)、委托接口(call_model以及可选的读取个人资料通道)、确定性技能标注层,以及涵盖质量、成本、延迟、委托率、前k路由保真度、供应商自偏好和反事实委托上限的多轴评价指标集。该框架对同伴信息的生成或传递方式保持无关性,因此学习型路由器、增强型同伴记忆、自适应档案构建和多步委托均可在此框架下进行评估。我们通过全模型池的五条件参考扫描(n=23,375个任务实例)对该框架进行特征刻画,得到三项基准级发现:(i)四种感知条件下的终端任务平均质量在统计上无显著差异(|β|≤0.010,p≥0.21),表明仅基于质量的评估将遗漏编排信号;(ii)在平均质量近乎相等的情况下,各条件下前1路由保真度范围为7.5%~29.5%,其中传递通道(按需工具vs预加载描述)的影响远大于描述内容;(iii)反事实上限显示,每个任务集中完美委托的性能比实测值高15~31个百分点,表明未来编排方法存在巨大的未开发空间。我们公开了框架、标注层、参考干预套件、分析流水线以及220份按条件划分的运行存档。