Orchestrating Human-AI Software Delivery: A Retrospective Longitudinal Field Study of Three Software Modernization Programs

Evidence on AI in software engineering still leans heavily toward individual task completion, while evidence on team-level delivery remains scarce. We report a retrospective longitudinal field study of Chiron, an industrial platform that coordinates humans and AI agents across four delivery stages: analysis, planning, implementation, and validation. The study covers three real software modernization programs -- a COBOL banking migration (~30k LOC), a large accounting modernization (~400k LOC), and a .NET/Angular mortgage modernization (~30k LOC) -- observed across five delivery configurations: a traditional baseline and four successive platform versions (V1--V4). The benchmark separates observed outcomes (stage durations, task volumes, validation-stage issues, first-release coverage) from modeled outcomes (person-days and senior-equivalent effort under explicit staffing scenarios). Under baseline staffing assumptions, portfolio totals move from 36.0 to 9.3 summed project-weeks; modeled raw effort falls from 1080.0 to 232.5 person-days; modeled senior-equivalent effort falls from 1080.0 to 139.5 SEE-days; validation-stage issue load falls from 8.03 to 2.09 issues per 100 tasks; and first-release coverage rises from 77.0% to 90.5%. V3 and V4 add acceptance-criteria validation, repository-native review, and hybrid human-agent execution, simultaneously improving speed, coverage, and issue load. The evidence supports a central thesis: the largest gains appear when AI is embedded in an orchestrated workflow rather than deployed as an isolated coding assistant.

翻译：关于人工智能在软件工程中的证据仍主要侧重于个体任务完成，而团队级交付的证据仍然稀缺。我们报告了对Chiron的回顾性纵向现场研究，这是一个协调人类与AI代理在四个交付阶段（分析、规划、实现和验证）的工业平台。该研究涵盖了三个真实软件现代化计划——一项COBOL银行迁移（约3万行代码）、一项大型会计系统现代化（约40万行代码）和一项.NET/Angular抵押贷款系统现代化（约3万行代码）——并在五种交付配置（传统基线和四个连续平台版本V1至V4）下进行观察。基准测试将观察结果（阶段持续时间、任务量、验证阶段问题、首发覆盖率）与建模结果（在明确人员配置情景下的人天数和高级等效工作量）区分开来。在基线人员配置假设下，项目组合总时间从36.0周降至9.3周；建模原始工作量从1080.0人天降至232.5人天；建模高级等效工作量从1080.0 SEE天降至139.5 SEE天；验证阶段问题负荷从每100个任务8.03个问题降至2.09个问题；首发覆盖率从77.0%上升至90.5%。V3和V4增加了验收标准验证、仓库原生审查以及人机混合执行，同时提升了速度、覆盖率和问题负荷。这些证据支持一个核心论点：当AI嵌入协调工作流而非作为孤立编码助手部署时，其收益最为显著。