Despite the remarkable success that Multi-Agent Code Generation Systems (MACGS) have achieved, the inherent complexity of multi-agent architectures produces substantial volumes of intermediate outputs. To date, the individual importance of these intermediate outputs to the system correctness remains opaque, which impedes targeted optimization of MACGS designs. To address this challenge, we propose CAM, the first \textbf{C}ausality-based \textbf{A}nalysis framework for \textbf{M}ACGS that systematically quantifies the contribution of different intermediate features for system correctness. By comprehensively categorizing intermediate outputs and systematically simulating realistic errors on intermediate features, we identify the important features for system correctness and aggregate their importance rankings. We conduct extensive empirical analysis on the identified importance rankings. Our analysis reveals intriguing findings: first, we uncover context-dependent features\textemdash features whose importance emerges mainly through interactions with other features, revealing that quality assurance for MACGS should incorporate cross-feature consistency checks; second, we reveal that hybrid backend MACGS with different backend LLMs assigned according to their relative strength achieves up to 7.2\% Pass@1 improvement, underscoring hybrid architectures as a promising direction for future MACGS design. We further demonstrate CAM's practical utility through two applications: (1) failure repair which achieves a 73.3\% success rate by optimizing top-3 importance-ranked features and (2) feature pruning that reduces up to 66.8\% intermediate token consumption while maintaining generation performance. Our work provides actionable insights for MACGS design and deployment, establishing causality analysis as a powerful approach for understanding and improving MACGS.
翻译:尽管多智能体代码生成系统(MACGS)已取得显著成功,但其多智能体架构的固有复杂性产生了大量中间输出。迄今为止,这些中间输出对系统正确性的个体重要性仍不明确,这阻碍了MACGS设计的针对性优化。为应对这一挑战,我们提出了CAM,这是首个基于因果关系的MACGS分析框架,可系统量化不同中间特征对系统正确性的贡献。通过全面分类中间输出并系统模拟中间特征上的实际错误,我们识别出对系统正确性重要的特征,并聚合其重要性排序。我们对识别出的重要性排序进行了广泛的实证分析。我们的分析揭示了有趣的发现:首先,我们发现了上下文依赖特征——其重要性主要通过与其他特征的交互显现,这表明MACGS的质量保证应纳入跨特征一致性检查;其次,我们揭示了根据相对优势分配不同后端大语言模型的混合后端MACGS,可实现高达7.2%的Pass@1提升,这凸显了混合架构作为未来MACGS设计的有前景方向。我们进一步通过两个应用展示了CAM的实际效用:(1)故障修复通过优化前3位重要性排序的特征实现了73.3%的成功率;(2)特征剪枝在保持生成性能的同时,减少了高达66.8%的中间令牌消耗。我们的工作为MACGS的设计与部署提供了可操作的见解,确立了因果分析作为理解和改进MACGS的强大方法。