Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure

Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.

翻译：城市路侧基础设施的自动化感知对于智慧城市管理至关重要，但通用模型往往难以捕捉必要的细粒度属性和领域规则。尽管大型视觉语言模型在开放世界识别方面表现出色，却常常无法依据工程标准准确解读复杂的设施状态，导致在实际应用中性能不可靠。为解决这一问题，我们提出了一个领域自适应框架，将视觉语言模型转化为专门用于智能基础设施分析的专业智能体。我们的方法将数据高效微调策略与知识驱动的推理机制相结合。具体而言，我们利用Grounding DINO进行开放词汇微调，以在最小监督下鲁棒地定位多样化资产，随后基于Qwen-VL进行LoRA自适应以实现深度语义属性推理。为减少幻觉并确保专业合规性，我们引入了双模态检索增强生成模块，该模块在推理过程中动态检索权威行业标准和视觉范例。在一个全面的新型城市路侧场景数据集上进行评估，我们的框架实现了58.9%的mAP检测性能和95.5%的属性识别准确率，为智能基础设施监控提供了一个鲁棒的解决方案。

相关内容

属性

关注 2

一个具体事物，总是有许许多多的性质与关系，我们把一个事物的性质与关系，都叫作事物的属性。事物与属性是不可分的，事物都是有属性的事物，属性也都是事物的属性。一个事物与另一个事物的相同或相异，也就是一个事物的属性与另一事物的属性的相同或相异。由于事物属性的相同或相异，客观世界中就形成了许多不同的事物类。具有相同属性的事物就形成一类，具有不同属性的事物就分别地形成不同的类。

从感知到认知：多模态大语言模型中视觉-语言交互推理综述

专知会员服务

30+阅读 · 2025年10月1日

移动边缘智能与大型语言模型综述

专知会员服务

41+阅读 · 2024年7月31日

大语言模型对汽车行业的影响和实践探索

专知会员服务

24+阅读 · 2024年4月27日

大语言模型视角下的智能规划方法综述

专知会员服务

137+阅读 · 2024年4月20日