Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single-agent workflow. This paper evaluates whether coordination strategy should be selected dynamically by problem class rather than fixed globally. We run a frozen matrix of 30 enterprise tasks spanning six industries, five problem classes, four execution conditions, three replications per cell, and four model arms: qwen_local, sonnet, gemma_openrouter, and an auxiliary openai cloud-validation arm. All 1,440 generated outputs are judged by a fixed Sonnet rubric. The main finding is bounded and operationally useful, but it is not the original strict H1. The pre-registered exact-winner/CI criterion is not supported: exact winner identity is unstable across model arms, and several predicted strategies are close to, but not above, the best observed alternative. A weaker near-best routing claim is strongly supported. In every pre-registered model arm and problem class, and again in the auxiliary OpenAI validation arm, the predicted strategy is within 0.10 quality-score points of the best observed condition. Structured compliance verification is the clearest exception to the original mapping: all arms favor single_agent rather than consensus. A pre-registered Kendall's W test finds no reliable difference between Vietnamese-domain and English-domain tasks in how consistently the four coordination conditions are ranked (mean W of 0.20 in both strata; signed-rank p = .85), so H2 is not supported. We conclude that enterprise coordination policy should use dynamic routing as a calibrated default, not as a deterministic winner-selection law.
翻译:企业多智能体系统日益展现出多种协调模式,但在部署中往往缺乏证据指导何时采用共识、辩论、综合或更简单的单智能体工作流。本文评估了协调策略是否应根据问题类别动态选择,而非全局固定。我们运行了一个固定矩阵,包含覆盖六个行业的30项企业任务、五个问题类别、四种执行条件、每个单元三次重复试验以及四个模型分支:qwen_local、sonnet、gemma_openrouter,以及辅助的openai云端验证分支。所有1,440个生成输出均由固定的Sonnet评估标准进行评判。主要发现具有边界性且在操作层面有效,但并非最初严格的H1假设。预先注册的精确胜者/置信区间标准未获支持:精确胜者身份在不同模型分支间不稳定,且若干预测策略虽接近但未超过最佳观察到的替代方案。而一个较弱的近优路由假设则得到有力支持。在每个预先注册的模型分支和问题类别中,以及辅助的OpenAI验证分支中,预测策略与最佳观察条件之间的质量评分差距均在0.10分以内。结构化合规验证是对原始映射最显著的例外:所有分支均偏好单智能体而非共识。预先注册的Kendall W检验显示,越南语领域与英语领域任务在四种协调条件排名一致性方面无可靠差异(两个层级的平均W值均为0.20;符号秩检验p = 0.85),故H2假设未获支持。我们得出结论:企业协调策略应将动态路由作为校准后的默认机制,而非确定性胜者选择法则。