The rapid emergence of multi-agent AI systems (MAS), including LangChain, CrewAI, and AutoGen, has shaped how large language model (LLM) applications are developed and orchestrated. However, little is known about how these systems evolve and are maintained in practice. This paper presents the first large-scale empirical study of open-source MAS, analyzing over 42K unique commits and over 4.7K resolved issues across eight leading systems. Our analysis identifies three distinct development profiles: sustained, steady, and burst-driven. These profiles reflect substantial variation in ecosystem maturity. Perfective commits constitute 40.8% of all changes, suggesting that feature enhancement is prioritized over corrective maintenance (27.4%) and adaptive updates (24.3%). Data about issues shows that the most frequent concerns involve bugs (22%), infrastructure (14%), and agent coordination challenges (10%). Issue reporting also increased sharply across all frameworks starting in 2023. Median resolution times range from under one day to about two weeks, with distributions skewed toward fast responses but a minority of issues requiring extended attention. These results highlight both the momentum and the fragility of the current ecosystem, emphasizing the need for improved testing infrastructure, documentation quality, and maintenance practices to ensure long-term reliability and sustainability.
翻译:多智能体AI系统(MAS)的迅速兴起,包括LangChain、CrewAI和AutoGen等,已经塑造了大型语言模型(LLM)应用的开发与编排方式。然而,对于这些系统在实践中如何演化与维护,目前知之甚少。本文首次对开源MAS进行了大规模实证研究,分析了八个主流系统超过42,000次独立提交和超过4,700个已解决问题。我们的分析识别出三种不同的开发模式:持续型、稳定型和爆发驱动型。这些模式反映了生态系统成熟度的显著差异。完善性提交占所有变更的40.8%,表明功能增强优先于纠正性维护(27.4%)和适应性更新(24.3%)。关于问题的数据显示,最常见的关注点涉及错误(22%)、基础设施(14%)和智能体协调挑战(10%)。从2023年开始,所有框架的问题报告数量均急剧增加。中位解决时间从不足一天到大约两周不等,其分布偏向快速响应,但少数问题需要更长的处理时间。这些结果突显了当前生态系统既充满活力又具有脆弱性的双重特征,强调了改进测试基础设施、文档质量和维护实践的必要性,以确保长期的可靠性和可持续性。