Understanding and Bridging the Planner-Coder Gap: A Systematic Study on the Robustness of Multi-Agent Systems for Code Generation

Multi-agent systems (MASs) have emerged as a promising paradigm for automated code generation, demonstrating impressive performance on established benchmarks. Despite their prosperous development, the fundamental mechanisms underlying their robustness remain poorly understood, raising critical concerns for real-world deployment. This paper conducts a systematic empirical study to uncover the internal robustness flaws of MASs using a mutation-based methodology. By designing a testing pipeline incorporating semantic-preserving mutation operators and a novel fitness function, we assess mainstream MASs across multiple datasets and LLMs. Our findings reveal substantial robustness flaws: semantically equivalent inputs cause drastic performance drops, with MASs failing to solve 7.9\%--83.3\% of problems they initially resolved successfully. Through comprehensive failure analysis, we discover a fundamental cause underlying these robustness issues: the \textit{planner-coder gap}, which accounts for 75.3\% of failures. This gap arises from information loss in the multi-stage transformation process where planning agents decompose requirements into underspecified plans, and coding agents subsequently misinterpret intricate logic during code generation. Based on this formulated information transformation process, we propose a \textit{repairing method} that mitigates information loss through multi-prompt generation and introduces a monitor agent to bridge the planner-coder gap. Evaluation shows that our repairing method effectively enhances the robustness of MASs by solving 40.0\%--88.9\% of identified failures. Our work uncovers critical robustness flaws in MASs and provides effective mitigation strategies, contributing essential insights for developing more reliable MASs for code generation.

翻译：多智能体系统（MASs）已成为自动化代码生成领域一种前景广阔的研究范式，在现有基准测试中展现出卓越性能。尽管其发展势头迅猛，但其鲁棒性的内在机制仍鲜为人知，这为其实际部署带来了严峻挑战。本文采用基于变异的方法，通过系统性的实证研究揭示了多智能体系统内在的鲁棒性缺陷。通过设计包含语义保持变异算子和新型适应度函数的测试流程，我们在多个数据集和大语言模型（LLMs）上评估了主流多智能体系统。研究结果揭示了显著的鲁棒性缺陷：语义等价的输入会导致性能急剧下降，多智能体系统在原本成功解决的问题中出现了7.9\%--83.3\%的失败率。通过全面的故障分析，我们发现了这些鲁棒性问题的根本原因：\textit{规划器-编码器鸿沟}，该因素导致了75.3\%的故障。这种鸿沟源于多阶段转换过程中的信息损失：规划智能体将需求分解为欠规范的方案，而编码智能体在后续代码生成过程中误解了复杂逻辑。基于这一形式化的信息转换过程，我们提出了一种\textit{修复方法}，通过多提示生成缓解信息损失，并引入监控智能体来弥合规划器-编码器鸿沟。评估结果表明，我们的修复方法能有效提升多智能体系统的鲁棒性，解决了40.0\%--88.9\%的已识别故障。本研究揭示了多智能体系统在代码生成领域的关键鲁棒性缺陷，并提供了有效的缓解策略，为开发更可靠的代码生成多智能体系统贡献了重要见解。