Recent advances in large-scale language models (LLMs) have made multi-agent architectures attractive for challenging reasoning tasks. However, many existing systems rely on stochastic routing or ad-hoc heuristics, making their behavior difficult to reproduce and their decision process hard to interpret. We propose ORCH, a deterministic coordination framework for discrete-choice reasoning that orchestrates heterogeneous LLMs. ORCH follows a ``many analyses, one decision'' paradigm: multiple base models independently produce structured analyses, and a dedicated merge agent outputs the final choice. The framework uses fixed rules for task decomposition and answer aggregation, keeping the pipeline predictable, reproducible, and training-free. Determinism here refers to fixed routing and aggregation rules under a fixed evaluation protocol, rather than strict bit-level reproducibility across deployments. To exploit model complementarity, we optionally introduce an EMA-guided router that updates agent selection using historical accuracy, latency, or cost; since it relies on answer-based feedback, it is mainly intended for benchmarking, controlled evaluation, or delayed-feedback settings. Experiments on MMLU, MMLU-Pro, and GSM8K show that ORCH consistently outperforms single-model baselines and a majority-vote ensemble. On MMLU-Pro, ORCH improves accuracy by over 10 points compared to the strongest baseline, and on GSM8K it yields gains exceeding 50 points; McNemar tests confirm statistical significance. The EMA router provides an additional 0.7--2.0 point accuracy boost, and ablations show that both multi-agent collaboration and routing contribute substantially. Overall, ORCH offers a practical path toward controllable, interpretable, and deployment-ready LLM-based agent systems for discrete-choice reasoning.
翻译:近年来,大规模语言模型(LLMs)的进展使得多智能体架构在处理复杂推理任务时展现出显著优势。然而,现有系统多依赖随机路由或临时启发式策略,导致其行为难以复现、决策过程缺乏可解释性。本文提出ORCH,一种面向离散选择推理的确定性协调框架,用于编排异构LLMs。ORCH遵循“多分析、单决策”范式:多个基础模型独立生成结构化分析,再由专用合并智能体输出最终选择。该框架采用固定规则进行任务分解与答案聚合,确保流程可预测、可复现且无需训练。此处的确定性特指在固定评估协议下采用确定的路由与聚合规则,而非追求跨部署场景的严格比特级可复现性。为充分利用模型互补性,我们可选地引入基于指数移动平均(EMA)引导的路由器,该组件依据历史准确率、延迟或成本动态更新智能体选择;由于依赖基于答案的反馈机制,该路由器主要适用于基准测试、受控评估或延迟反馈场景。在MMLU、MMLU-Pro和GSM8K数据集上的实验表明,ORCH持续优于单模型基线及多数投票集成方法。在MMLU-Pro上,ORCH相比最强基线提升准确率超过10个百分点;在GSM8K上增益超过50个百分点;McNemar检验证实了统计显著性。EMA路由器额外带来0.7–2.0个百分点的准确率提升,消融实验表明多智能体协作与路由机制均贡献显著。总体而言,ORCH为构建可控、可解释且具备部署就绪性的LLM智能体系统,提供了一条面向离散选择推理的实用技术路径。