Architectural Decision Records (ADRs) play a central role in maintaining software architecture quality, yet many decision violations go unnoticed because projects lack both systematic documentation and automated detection mechanisms. Recent advances in Large Language Models (LLMs) open up new possibilities for automating architectural reasoning at scale. We investigated how effectively LLMs can identify decision violations in open-source systems by examining their agreement, accuracy, and inherent limitations. Our study analyzed 980 ADRs across 109 GitHub repositories using a multi-model pipeline in which one LLM primary screens potential decision violations, and three additional LLMs independently validate the reasoning. We assessed agreement, accuracy, precision, and recall, and complemented the quantitative findings with expert evaluation. The models achieved substantial agreement and strong accuracy for explicit, code-inferable decisions. Accuracy falls short for implicit or deployment-oriented decisions that depend on deployment configuration or organizational knowledge. Therefore, LLMs can meaningfully support validation of architectural decision compliance; however, they are not yet replacing human expertise for decisions not focused on code.
翻译:架构决策记录(ADRs)在维护软件架构质量中起着核心作用,然而许多决策违规因项目缺乏系统性文档和自动化检测机制而未被察觉。大型语言模型(LLMs)的最新进展为规模化自动化架构推理开辟了新途径。我们通过考察LLMs的一致性、准确性及内在局限性,研究了其在开源系统中识别决策违规的有效性。本研究采用多模型流程分析了109个GitHub仓库中的980条ADR,其中一个LLM主模型负责初步筛查潜在决策违规,另外三个LLM独立验证推理过程。我们评估了模型间一致性、准确率、精确度与召回率,并通过专家评估对量化结果进行补充。对于显式且可通过代码推断的决策,模型表现出高度一致性和较强准确性;而对于依赖部署配置或组织知识的隐式或面向部署的决策,其准确性则显不足。因此,LLMs能够有效支持架构决策合规性验证,但对于非代码核心的决策,尚无法替代人类专业知识。