Social media imagery provides a low-latency source of situational information during natural and human-induced disasters, enabling rapid damage assessment and response. While Visual Question Answering (VQA) has shown strong performance in general-purpose domains, its suitability for the complex and safety-critical reasoning required in disaster response remains unclear. We introduce DisasterVQA, a benchmark dataset designed for perception and reasoning in crisis contexts. DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning diverse events such as floods, wildfires, and earthquakes. Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA, the dataset includes binary, multiple-choice, and open-ended questions covering situational awareness and operational decision-making tasks. We benchmark seven state-of-the-art vision-language models and find performance variability across question types, disaster categories, regions, and humanitarian tasks. Although models achieve high accuracy on binary questions, they struggle with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, particularly for underrepresented disaster scenarios. DisasterVQA provides a challenging and practical benchmark to guide the development of more robust and operationally meaningful vision-language models for disaster response. The dataset is publicly available at https://zenodo.org/records/18267770.
翻译:社交媒体图像为自然灾害和人为灾害期间提供了低延迟的态势信息源,支持快速损害评估与应急响应。尽管视觉问答在通用领域已展现出强大性能,但其是否适用于灾害响应中所需的复杂且安全关键型推理任务仍不明确。本文提出DisasterVQA——一个专为危机情境下的感知与推理设计的基准数据集。该数据集包含1,395张真实场景图像及4,405个专家标注的问答对,涵盖洪水、野火、地震等多种灾害事件。基于FEMA ESF和OCHA MIRA等人道主义框架,数据集设计了涵盖态势感知与行动决策任务的二元判断、多项选择及开放式问题。我们对七种前沿视觉语言模型进行基准测试,发现其性能在不同问题类型、灾害类别、地理区域及人道主义任务间存在显著差异。虽然模型在二元问题上表现出较高准确率,但在细粒度定量推理、物体计数及情境敏感型解释方面存在明显不足,尤其对代表性不足的灾害场景处理能力薄弱。DisasterVQA为开发更鲁棒且具有实际应用价值的灾害响应视觉语言模型提供了具有挑战性的实践基准。数据集已公开于https://zenodo.org/records/18267770。