Timely interpretation of satellite imagery is critical for disaster response, yet existing vision-language benchmarks for remote sensing largely focus on coarse labels and image-level recognition, overlooking the functional understanding and instruction robustness required in real humanitarian workflows. We introduce DisasterInsight, a multimodal benchmark designed to evaluate vision-language models (VLMs) on realistic disaster analysis tasks. DisasterInsight restructures the xBD dataset into approximately 112K building-centered instances and supports instruction-diverse evaluation across multiple tasks, including building-function classification, damage-level and disaster-type classification, counting, and structured report generation aligned with humanitarian assessment guidelines. To establish domain-adapted baselines, we propose DI-Chat, obtained by fine-tuning existing VLM backbones on disaster-specific instruction data using parameter-efficient Low-Rank Adaptation (LoRA). Extensive experiments on state-of-the-art generic and remote-sensing VLMs reveal substantial performance gaps across tasks, particularly in damage understanding and structured report generation. DI-Chat achieves significant improvements on damage-level and disaster-type classification as well as report generation quality, while building-function classification remains challenging for all evaluated models. DisasterInsight provides a unified benchmark for studying grounded multimodal reasoning in disaster imagery.
翻译:卫星图像的及时解译对灾害响应至关重要,然而现有的遥感视觉-语言基准大多侧重于粗粒度标签和图像级识别,忽视了实际人道主义工作流程所需的功能性理解和指令鲁棒性。我们提出了DisasterInsight,一个为评估视觉-语言模型在真实灾害分析任务上表现而设计的多模态基准。DisasterInsight将xBD数据集重构为约11.2万个以建筑物为中心的实例,并支持跨多个任务的指令多样性评估,包括建筑物功能分类、损毁程度与灾害类型分类、计数,以及符合人道主义评估指南的结构化报告生成。为建立领域适应的基线,我们提出了DI-Chat,通过对现有视觉-语言模型骨干在灾害专用指令数据上使用参数高效的LoRA进行微调获得。在先进通用及遥感视觉-语言模型上的大量实验揭示了各项任务间的显著性能差距,尤其在损毁理解和结构化报告生成方面。DI-Chat在损毁程度与灾害类型分类以及报告生成质量上取得了显著提升,而建筑物功能分类对所有评估模型而言仍具挑战。DisasterInsight为研究灾害图像中的基础化多模态推理提供了一个统一的基准。