Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
翻译:城市路侧基础设施的自动化感知对于智慧城市管理至关重要,但通用模型往往难以捕捉必要的细粒度属性和领域规则。尽管大型视觉语言模型在开放世界识别方面表现出色,却常常无法依据工程标准准确解读复杂的设施状态,导致在实际应用中性能不可靠。为解决这一问题,我们提出了一个领域自适应框架,将视觉语言模型转化为专门用于智能基础设施分析的专业智能体。我们的方法将数据高效微调策略与知识驱动的推理机制相结合。具体而言,我们利用Grounding DINO进行开放词汇微调,以在最小监督下鲁棒地定位多样化资产,随后基于Qwen-VL进行LoRA自适应以实现深度语义属性推理。为减少幻觉并确保专业合规性,我们引入了双模态检索增强生成模块,该模块在推理过程中动态检索权威行业标准和视觉范例。在一个全面的新型城市路侧场景数据集上进行评估,我们的框架实现了58.9%的mAP检测性能和95.5%的属性识别准确率,为智能基础设施监控提供了一个鲁棒的解决方案。