Recovering accurate architecture from large-scale legacy software is hindered by architectural drift, missing relations, and the limited context of Large Language Models (LLMs). We present ArchAgent, a scalable agent-based framework that combines static analysis, adaptive code segmentation, and LLM-powered synthesis to reconstruct multiview, business-aligned architectures from cross-repository codebases. ArchAgent introduces scalable diagram generation with contextual pruning and integrates cross-repository data to identify business-critical modules. Evaluations of typical large-scale GitHub projects show significant improvements over existing benchmarks. An ablation study confirms that dependency context improves the accuracy of generated architectures of production-level repositories, and a real-world case study demonstrates effective recovery of critical business logics from legacy projects. The dataset is available at https://github.com/panrusheng/arch-eval-benchmark.
翻译:从大规模遗留软件中恢复精确架构面临架构漂移、关系缺失以及大型语言模型(LLM)上下文限制等挑战。本文提出ArchAgent,一种基于智能体的可扩展框架,通过结合静态分析、自适应代码分割与LLM驱动的综合技术,实现跨仓库代码库的多视图业务对齐架构重建。ArchAgent引入基于上下文剪枝的可扩展图表生成机制,并整合跨仓库数据以识别业务关键模块。对典型大规模GitHub项目的评估显示,本方法在现有基准上取得显著提升。消融实验证实依赖上下文能提高生产级仓库生成架构的准确性,实际案例研究展示了从遗留项目中有效恢复关键业务逻辑的能力。数据集发布于 https://github.com/panrusheng/arch-eval-benchmark。