Multi-agent systems (MASs) have emerged as a promising paradigm for automated code generation, demonstrating impressive performance on established benchmarks by decomposing complex coding tasks across specialized agents with different roles. Despite their prosperous development and adoption, their robustness remains pressingly under-explored, raising critical concerns for real-world deployment. This paper presents the first comprehensive study examining the robustness of MASs for code generation through a fuzzing-based testing approach. By designing a fuzzing pipeline incorporating semantic-preserving mutation operators and a novel fitness function, we assess mainstream MASs across multiple datasets and LLMs. Our findings reveal substantial robustness flaws of various popular MASs: they fail to solve 7.9%-83.3% of problems they initially resolved successfully after applying the semantic-preserving mutations. Through comprehensive failure analysis, we identify a common yet largely overlooked cause of the robustness issue: miscommunications between planning and coding agents, where plans lack sufficient detail and coding agents misinterpret intricate logic, aligning with the challenges inherent in a multi-stage information transformation process. Accordingly, we also propose a repairing method that encompasses multi-prompt generation and introduces a new monitor agent to address this issue. Evaluation shows that our repairing method effectively enhances the robustness of MASs by solving 40.0%-88.9% of identified failures. Our work uncovers critical robustness flaws in MASs and provides effective mitigation strategies, contributing essential insights for developing more reliable MASs for code generation.
翻译:多智能体系统(MASs)已成为自动化代码生成的一种有前景的范式,其通过将复杂的编码任务分解给具有不同角色的专门智能体,在现有基准测试中展现出令人印象深刻的性能。尽管其发展迅速且应用广泛,但其鲁棒性仍亟待深入探索,这为其实际部署带来了关键担忧。本文首次通过基于模糊测试的方法,对用于代码生成的多智能体系统的鲁棒性进行了全面研究。通过设计一个融合了语义保持变异算子和新颖适应度函数的模糊测试流程,我们在多个数据集和大型语言模型上评估了主流的多智能体系统。我们的研究结果揭示了各种流行多智能体系统存在显著的鲁棒性缺陷:在应用语义保持变异后,它们无法解决其最初成功解决的7.9%至83.3%的问题。通过全面的故障分析,我们确定了导致鲁棒性问题的一个普遍但很大程度上被忽视的原因:规划智能体与编码智能体之间的沟通失误,即规划缺乏足够的细节,而编码智能体误解了复杂的逻辑,这与多阶段信息转换过程中固有的挑战相符。据此,我们还提出了一种修复方法,该方法包含多提示生成并引入一个新的监控智能体来解决此问题。评估表明,我们的修复方法通过解决40.0%至88.9%的已识别故障,有效增强了多智能体系统的鲁棒性。我们的工作揭示了多智能体系统中的关键鲁棒性缺陷,并提供了有效的缓解策略,为开发更可靠的代码生成多智能体系统贡献了重要见解。