DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.

翻译：我们提出DecisionBench——一个面向长程自主工作流中涌现委托的基准测试框架。该框架包含以下固定组件：任务集（GAIA、tau-bench、BFCL多轮交互）、模型池（11个模型，涵盖7个供应商系列）、委托接口（call_model以及可选的读取个人资料通道）、确定性技能标注层，以及涵盖质量、成本、延迟、委托率、前k路由保真度、供应商自偏好和反事实委托上限的多轴评价指标集。该框架对同伴信息的生成或传递方式保持无关性，因此学习型路由器、增强型同伴记忆、自适应档案构建和多步委托均可在此框架下进行评估。我们通过全模型池的五条件参考扫描（n=23,375个任务实例）对该框架进行特征刻画，得到三项基准级发现：（i）四种感知条件下的终端任务平均质量在统计上无显著差异（|β|≤0.010，p≥0.21），表明仅基于质量的评估将遗漏编排信号；（ii）在平均质量近乎相等的情况下，各条件下前1路由保真度范围为7.5%~29.5%，其中传递通道（按需工具vs预加载描述）的影响远大于描述内容；（iii）反事实上限显示，每个任务集中完美委托的性能比实测值高15~31个百分点，表明未来编排方法存在巨大的未开发空间。我们公开了框架、标注层、参考干预套件、分析流水线以及220份按条件划分的运行存档。