Multi-agent AI systems show promise for automating software engineering tasks, yet existing approaches suffer from coordination overhead, quality control gaps, and limited human oversight. We introduce SPOQ (Specialist Orchestrated Queuing), a methodology combining three innovations: (1) wave-based topological dispatch that computes parallel execution waves from task dependency graphs; (2) dual validation gates applying quality metrics before execution (planning validation) and after (code validation) to reduce rework cycles; and (3) Human-as-an-Agent (HaaA) integration, where a human specialist participates in decomposition and can be consulted during execution. SPOQ uses a three-tier agent hierarchy (Opus workers, Sonnet reviewers, Haiku investigators) to optimize cost-quality tradeoffs. We evaluate SPOQ through four experiments. Experiment 1: wave dispatch approaches the critical-path lower bound (ratio 1.03--1.11, speedup up to 14.3x); on a 2-slot local backend it delivers a stable 1.4x speedup. Experiment 2: SPOQ improves planning coverage from 93.0 to 99.75, eliminates cyclic plans, and lifts parallelism from 31.0 to 75.25. Experiment 3: dual validation reduces defects from 0.34 to 0.20 per task and lifts test pass rate from 91.25% to 99.75%. Experiment 4: human review reduces residual defects from 0.47 to 0.03 per task. Results are replicated on a locally hosted open-weights model (Qwen3.6-35B-A3B), verifying gains are attributable to orchestration rather than any specific model. A longitudinal study across 17 repositories, 8,589 commits, 1,822 tasks, and 13,866 tests (99.87% pass rate) provides ecological validation.
翻译:多智能体AI系统在自动化软件工程任务方面展现出潜力,但现有方法存在协调开销大、质量控制缺失及人工监督有限等问题。我们提出SPOQ(专业化编排队列)方法,该技术整合三项创新:(1)基于波次的拓扑调度——从任务依赖图中计算并行执行波次;(2)双重验证门控——在执行前(规划验证)与执行后(代码验证)应用质量指标以减少返工循环;(3)人类即智能体(HaaA)集成——人类专家参与任务分解并在执行过程中提供咨询。SPOQ采用三层智能体层级结构(Opus工作者、Sonnet评审者、Haiku调查者)以优化成本与质量权衡。我们通过四项实验评估SPOQ。实验1:波次调度接近关键路径下界(比值1.03–1.11,加速比最高达14.3倍);在2槽位本地后端上稳定实现1.4倍加速。实验2:SPOQ将规划覆盖率从93.0提升至99.75,消除循环规划,并行度从31.0提高至75.25。实验3:双重验证将每任务缺陷数从0.34降至0.20,测试通过率从91.25%提升至99.75%。实验4:人工审查将每任务残留缺陷数从0.47降至0.03。基于本地开源模型(Qwen3.6-35B-A3B)的复现实验验证了性能提升归因于编排机制而非特定模型。跨越17个代码仓库、8,589次提交、1,822项任务及13,866项测试(通过率99.87%)的纵向研究提供了生态效度验证。