Evidence on AI in software engineering still leans heavily toward individual task completion, while evidence on team-level delivery remains scarce. We report a retrospective longitudinal field study of Chiron, an industrial platform that coordinates humans and AI agents across four delivery stages: analysis, planning, implementation, and validation. The study covers three real software modernization programs -- a COBOL banking migration (~30k LOC), a large accounting modernization (~400k LOC), and a .NET/Angular mortgage modernization (~30k LOC) -- observed across five delivery configurations: a traditional baseline and four successive platform versions (V1--V4). The benchmark separates observed outcomes (stage durations, task volumes, validation-stage issues, first-release coverage) from modeled outcomes (person-days and senior-equivalent effort under explicit staffing scenarios). Under baseline staffing assumptions, portfolio totals move from 36.0 to 9.3 summed project-weeks; modeled raw effort falls from 1080.0 to 232.5 person-days; modeled senior-equivalent effort falls from 1080.0 to 139.5 SEE-days; validation-stage issue load falls from 8.03 to 2.09 issues per 100 tasks; and first-release coverage rises from 77.0% to 90.5%. V3 and V4 add acceptance-criteria validation, repository-native review, and hybrid human-agent execution, simultaneously improving speed, coverage, and issue load. The evidence supports a central thesis: the largest gains appear when AI is embedded in an orchestrated workflow rather than deployed as an isolated coding assistant.
翻译:关于人工智能在软件工程中的证据仍主要侧重于个体任务完成,而团队级交付的证据仍然稀缺。我们报告了对Chiron的回顾性纵向现场研究,这是一个协调人类与AI代理在四个交付阶段(分析、规划、实现和验证)的工业平台。该研究涵盖了三个真实软件现代化计划——一项COBOL银行迁移(约3万行代码)、一项大型会计系统现代化(约40万行代码)和一项.NET/Angular抵押贷款系统现代化(约3万行代码)——并在五种交付配置(传统基线和四个连续平台版本V1至V4)下进行观察。基准测试将观察结果(阶段持续时间、任务量、验证阶段问题、首发覆盖率)与建模结果(在明确人员配置情景下的人天数和高级等效工作量)区分开来。在基线人员配置假设下,项目组合总时间从36.0周降至9.3周;建模原始工作量从1080.0人天降至232.5人天;建模高级等效工作量从1080.0 SEE天降至139.5 SEE天;验证阶段问题负荷从每100个任务8.03个问题降至2.09个问题;首发覆盖率从77.0%上升至90.5%。V3和V4增加了验收标准验证、仓库原生审查以及人机混合执行,同时提升了速度、覆盖率和问题负荷。这些证据支持一个核心论点:当AI嵌入协调工作流而非作为孤立编码助手部署时,其收益最为显著。