SPOQ: Specialist Orchestrated Queuing for Multi-Agent Software Engineering

Multi-agent AI systems show promise for automating software engineering tasks, yet existing approaches suffer from coordination overhead, quality control gaps, and limited human oversight. We introduce SPOQ (Specialist Orchestrated Queuing), a methodology combining three innovations: (1) wave-based topological dispatch that computes parallel execution waves from task dependency graphs; (2) dual validation gates applying quality metrics before execution (planning validation) and after (code validation) to reduce rework cycles; and (3) Human-as-an-Agent (HaaA) integration, where a human specialist participates in decomposition and can be consulted during execution. SPOQ uses a three-tier agent hierarchy (Opus workers, Sonnet reviewers, Haiku investigators) to optimize cost-quality tradeoffs. We evaluate SPOQ through four experiments. Experiment 1: wave dispatch approaches the critical-path lower bound (ratio 1.03--1.11, speedup up to 14.3x); on a 2-slot local backend it delivers a stable 1.4x speedup. Experiment 2: SPOQ improves planning coverage from 93.0 to 99.75, eliminates cyclic plans, and lifts parallelism from 31.0 to 75.25. Experiment 3: dual validation reduces defects from 0.34 to 0.20 per task and lifts test pass rate from 91.25% to 99.75%. Experiment 4: human review reduces residual defects from 0.47 to 0.03 per task. Results are replicated on a locally hosted open-weights model (Qwen3.6-35B-A3B), verifying gains are attributable to orchestration rather than any specific model. A longitudinal study across 17 repositories, 8,589 commits, 1,822 tasks, and 13,866 tests (99.87% pass rate) provides ecological validation.

翻译：多智能体AI系统在自动化软件工程任务方面展现出潜力，但现有方法存在协调开销大、质量控制缺失及人工监督有限等问题。我们提出SPOQ（专业化编排队列）方法，该技术整合三项创新：（1）基于波次的拓扑调度——从任务依赖图中计算并行执行波次；（2）双重验证门控——在执行前（规划验证）与执行后（代码验证）应用质量指标以减少返工循环；（3）人类即智能体（HaaA）集成——人类专家参与任务分解并在执行过程中提供咨询。SPOQ采用三层智能体层级结构（Opus工作者、Sonnet评审者、Haiku调查者）以优化成本与质量权衡。我们通过四项实验评估SPOQ。实验1：波次调度接近关键路径下界（比值1.03–1.11，加速比最高达14.3倍）；在2槽位本地后端上稳定实现1.4倍加速。实验2：SPOQ将规划覆盖率从93.0提升至99.75，消除循环规划，并行度从31.0提高至75.25。实验3：双重验证将每任务缺陷数从0.34降至0.20，测试通过率从91.25%提升至99.75%。实验4：人工审查将每任务残留缺陷数从0.47降至0.03。基于本地开源模型（Qwen3.6-35B-A3B）的复现实验验证了性能提升归因于编排机制而非特定模型。跨越17个代码仓库、8,589次提交、1,822项任务及13,866项测试（通过率99.87%）的纵向研究提供了生态效度验证。