Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning

Automated building facade inspection is a critical component of urban resilience and smart city maintenance. Traditionally, this field has relied on specialized discriminative models (e.g., YOLO, Mask R-CNN) that excel at pixel-level localization but are constrained to passive perception and worse generization without the visual understandng to interpret structural topology. Large Multimodal Models (LMMs) promise a paradigm shift toward active reasoning, yet their application in such high-stakes engineering domains lacks rigorous evaluation standards. To bridge this gap, we introduce a human-in-the-loop semi-automated annotation framework, leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology. Building on this foundation, we present \textit{DefectBench}, the first multi-dimensional benchmark designed to interrogate LMMs beyond basic semantic recognition. \textit{DefectBench} evaluates 18 state-of-the-art (SOTA) LMMs across three escalating cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation. Extensive experiments reveal that while current LMMs demonstrate exceptional topological awareness and semantic understanding (effectively diagnosing "what" and "how"), they exhibit significant deficiencies in metric localization precision ("where"). Crucially, however, we validate the viability of zero-shot generative segmentation, showing that general-purpose foundation models can rival specialized supervised networks without domain-specific training. This work provides both a rigorous benchmarking standard and a high-quality open-source database, establishing a new baseline for the advancement of autonomous AI agents in civil engineering.

翻译：自动化建筑外立面检测是城市韧性和智慧城市维护的关键组成部分。传统上，该领域依赖于专门的判别式模型（如YOLO、Mask R-CNN），这些模型在像素级定位方面表现出色，但受限于被动感知能力，且泛化性较差，缺乏解释结构拓扑的视觉理解能力。大型多模态模型（LMMs）有望带来向主动推理的范式转变，然而，它们在如此高风险的工程领域中的应用尚缺乏严格的评估标准。为填补这一空白，我们提出了一种人机协同的半自动标注框架，利用专家验证提案将12个分散的数据集统一为标准化的分层本体。在此基础上，我们提出了\textit{DefectBench}，这是首个旨在超越基础语义识别、对LMMs进行全面多维度评估的基准。\textit{DefectBench}从三个递进的认知维度评估了18个最先进（SOTA）的LMMs：语义感知、空间定位和生成式几何分割。大量实验表明，尽管当前的LMMs展现出卓越的拓扑感知和语义理解能力（有效诊断“是什么”和“怎么样”），但在度量定位精度（“在哪里”）方面存在显著不足。然而，关键的是，我们验证了零样本生成式分割的可行性，表明通用基础模型无需领域特定训练即可与专门的有监督网络相媲美。这项工作既提供了一个严格的基准标准，又建立了一个高质量的开源数据库，为土木工程中自主AI智能体的发展设立了新的基线。