Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark features a five-level, cognition-oriented evaluation framework (i.e., Knowledge Memorization, Understanding, Reasoning, Calculation, and Application). Based on the framework, 23 representative evaluation tasks were defined. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an "LLM-as-a-Judge" approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.
翻译:大语言模型(LLMs)作为一种新兴信息技术,在建筑、工程与施工(AEC)领域的应用日益广泛。它们已展现出优化建筑全生命周期流程的潜力。然而,在这一高度专业化且安全关键的领域中,大语言模型的鲁棒性与可靠性仍有待评估。为应对这一挑战,本文建立了AECBench——一个旨在量化当前大语言模型在AEC领域优势与局限性的综合基准。该基准采用五级认知导向评估框架(即知识记忆、理解、推理、计算与应用)。基于该框架,我们定义了23项代表性评估任务。这些任务源自真实的AEC实践,涵盖从规范检索到专业文档生成等广泛范畴。随后,主要由工程师构建了一个包含4800道问题的数据集,其题型涵盖开放式问答等多种形式,并通过两轮专家评审进行验证。此外,本研究引入了“以LLM为评判者”的方法,利用专家制定的评分标准,为评估复杂的长文本回答提供了可扩展且一致的方法论。通过对九个大语言模型的评估,我们发现在五个认知层级上存在明显的性能递减趋势。尽管模型在知识记忆与理解层级的基础任务中表现熟练,但在解释建筑规范表格知识、执行复杂推理与计算、以及生成领域特定文档等方面仍存在显著性能缺陷。因此,本研究为未来实现大语言模型在安全关键工程实践中鲁棒可靠集成的研发工作奠定了重要基础。