The processes of classification and segmentation utilizing artificial intelligence play a vital role in the automation of disaster assessments. However, contemporary VLMs produce details that are inadequately aligned with the objectives of disaster assessment, primarily due to their deficiency in domain knowledge and the absence of a more refined descriptive process. This research presents the Vision Language Caption Enhancer (VLCE), a dedicated multimodal framework aimed at integrating external semantic knowledge from ConceptNet and WordNet to improve the captioning process. The objective is to produce disaster-specific descriptions that effectively convert raw visual data into actionable intelligence. VLCE utilizes two separate architectures: a CNN-LSTM model that incorporates a ResNet50 backbone, pretrained on EuroSat for satellite imagery (xBD dataset), and a Vision Transformer developed for UAV imagery (RescueNet dataset). In various architectural frameworks and datasets, VLCE exhibits a consistent advantage over baseline models such as LLaVA and QwenVL. Our optimal configuration reaches an impressive 95.33\% on InfoMetIC for UAV imagery while also demonstrating strong performance across satellite imagery. The proposed framework signifies a significant transition from basic visual classification to the generation of comprehensive situational intelligence, demonstrating immediate applicability for implementation in real-time disaster assessment systems.
翻译:利用人工智能进行分类与分割的过程在灾害评估自动化中发挥着至关重要的作用。然而,当前的视觉语言模型(VLMs)生成的细节与灾害评估目标契合度不足,这主要归因于其领域知识的匮乏以及缺乏更精细的描述过程。本研究提出了视觉语言描述增强器(VLCE),这是一个专门的多模态框架,旨在整合来自ConceptNet和WordNet的外部语义知识以改进描述生成过程。其目标在于生成灾害专用的描述,从而有效地将原始视觉数据转化为可操作的情报。VLCE采用两种独立的架构:一种是结合了ResNet50骨干网络、并在EuroSat数据集上针对卫星图像(xBD数据集)进行预训练的CNN-LSTM模型;另一种是为无人机图像(RescueNet数据集)开发的视觉Transformer(Vision Transformer)。在各种架构框架和数据集上,VLCE相较于LLaVA和QwenVL等基线模型均展现出持续的优势。我们的最优配置在无人机图像的InfoMetIC指标上达到了令人印象深刻的95.33%,同时在卫星图像上也表现出强劲的性能。所提出的框架标志着从基础的视觉分类向生成全面态势情报的重大转变,并证明了其在实时灾害评估系统中具备直接应用的潜力。