Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects and comprehensive reasoning. Based on city planning needs, we develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis. The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded. As objects are the basis for complex relational reasoning, we propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way. To preserve refined spatial locations and semantics, SOBA leverages a segmentation network for object semantics generation. The object-guided attention aggregates object interior features via pseudo masks, and bidirectional cross-attention further models object external relations hierarchically. To optimize object counting, we propose a numerical difference loss that dynamically adds difference penalties, unifying the classification and regression tasks. Experimental results show that SOBA outperforms both advanced general and remote sensing methods. We believe this dataset and framework provide a strong benchmark for Earth vision's complex analysis. The project page is at https://Junjue-Wang.github.io/homepage/EarthVQA.
翻译:地球视觉研究通常聚焦于提取地理空间目标的位置和类别,但忽视了目标间关系的探索与综合推理。基于城市规划需求,我们构建了一个多模态多任务VQA数据集(EarthVQA),以推动基于关系推理的判断、计数与综合分析。该数据集包含6000张图像、对应的语义掩码及208593个问答对,嵌入了城乡治理需求。鉴于目标是复杂关系推理的基础,我们提出语义目标感知框架(SOBA),以目标为中心的方式推进VQA。为保留精细的空间位置与语义信息,SOBA利用分割网络生成目标语义;目标导向注意力通过伪掩码聚合目标内部特征,而双向交叉注意力进一步分层建模目标外部关系。针对计数优化,我们提出数值差分损失函数,通过动态添加差分惩罚统一分类与回归任务。实验结果表明,SOBA在通用方法及遥感方法上均表现优异。我们认为该数据集与框架为地球视觉的复杂分析提供了强基准。项目页面:https://Junjue-Wang.github.io/homepage/EarthVQA。