EarthVL：一种渐进式地球视觉-语言理解与生成框架 (EarthVL: A Progressive Earth Vision-Language Understanding and Generation Framework)

Earth vision has achieved milestones in geospatial object recognition but lacks exploration in object-relational reasoning, limiting comprehensive scene understanding. To address this, a progressive Earth vision-language understanding and generation framework is proposed, including a multi-task dataset (EarthVLSet) and a semantic-guided network (EarthVLNet). Focusing on city planning applications, EarthVLSet includes 10.9k sub-meter resolution remote sensing images, land-cover masks, and 761.5k textual pairs involving both multiple-choice and open-ended visual question answering (VQA) tasks. In an object-centric way, EarthVLNet is proposed to progressively achieve semantic segmentation, relational reasoning, and comprehensive understanding. The first stage involves land-cover segmentation to generate object semantics for VQA guidance. Guided by pixel-wise semantics, the object awareness based large language model (LLM) performs relational reasoning and knowledge summarization to generate the required answers. As for optimization, the numerical difference loss is proposed to dynamically add difference penalties, addressing the various objects' statistics. Three benchmarks, including semantic segmentation, multiple-choice, and open-ended VQA demonstrated the superiorities of EarthVLNet, yielding three future directions: 1) segmentation features consistently enhance VQA performance even in cross-dataset scenarios; 2) multiple-choice tasks show greater sensitivity to the vision encoder than to the language decoder; and 3) open-ended tasks necessitate advanced vision encoders and language decoders for an optimal performance. We believe this dataset and method will provide a beneficial benchmark that connects ''image-mask-text'', advancing geographical applications for Earth vision.

翻译：地球视觉已在地理空间目标识别方面取得里程碑式进展，但在目标关系推理方面尚缺乏探索，限制了场景的全面理解。为此，本文提出一种渐进式地球视觉-语言理解与生成框架，包含一个多任务数据集（EarthVLSet）和一个语义引导网络（EarthVLNet）。聚焦城市规划应用，EarthVLSet包含10.9k幅亚米级分辨率遥感影像、土地覆盖掩码及76.15万条文本对，涵盖多项选择与开放式视觉问答（VQA）任务。EarthVLNet以目标为中心，渐进实现语义分割、关系推理与综合理解。第一阶段通过土地覆盖分割生成用于VQA引导的目标语义。在像素级语义引导下，基于目标感知的大语言模型（LLM）执行关系推理与知识归纳以生成所需答案。优化方面，提出数值差异损失函数以动态添加差异惩罚，应对各类目标的统计特性。在语义分割、多项选择与开放式VQA三类基准测试中，EarthVLNet均表现出优越性能，并揭示三个未来方向：1）分割特征能持续提升VQA性能，即使在跨数据集场景中；2）多项选择任务对视觉编码器的敏感性高于语言解码器；3）开放式任务需要更先进的视觉编码器与语言解码器以获得最优性能。我们相信该数据集与方法将构建连接“影像-掩码-文本”的有益基准，推动地球视觉在地理应用中的发展。