AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark features a five-level, cognition-oriented evaluation framework (i.e., Knowledge Memorization, Understanding, Reasoning, Calculation, and Application). Based on the framework, 23 representative evaluation tasks were defined. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an "LLM-as-a-Judge" approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.

翻译：大语言模型作为一种新兴信息技术，在建筑、工程与施工领域正获得日益广泛的应用，并展现出优化建筑全生命周期流程的潜力。然而，该模型在此类高度专业化且安全关键的领域中应用的鲁棒性与可靠性仍有待评估。为应对这一挑战，本文构建了AECBench——一个旨在量化当前大语言模型在AEC领域能力优势与局限性的综合基准。该基准采用包含五个认知层级（即知识记忆、理解、推理、计算与应用）的评估框架，并基于此框架定义了23项具有代表性的评估任务。这些任务均源自真实AEC实践，涵盖从规范检索到专业文档生成等多元范畴。通过两轮专家评审验证，主要由工程技术人员构建了包含4800道开放式问题等多样化题型的评估数据集。此外，本研究引入“大语言模型即评判者”方法，结合专家制定的评分标准，为复杂长文本响应的评估提供了可扩展且一致的方法论。通过对九种大语言模型的评估，研究揭示了模型在五个认知层级上存在明显的性能衰减现象：尽管在知识记忆与理解层级的基础任务中表现良好，但模型在建筑规范表格知识解读、复杂推理与计算执行以及领域专业文档生成等方面均存在显著性能缺陷。本研究为未来实现大语言模型在安全关键工程实践中可靠、稳健的集成应用奠定了研究基础。