FUSAR-KLIP: Towards Multimodal Foundation Models for Remote Sensing

Cross-modal artificial intelligence, represented by visual language models, has achieved significant success in general image understanding. However, a fundamental cognitive inconsistency exists between general visual representation and remote sensing image interpretation: remote sensing images couple topography, terrain, and spatial structure, thereby inherently requiring models to possess deep geoscientific understanding. This cognitive difference is further amplified in synthetic aperture radar (SAR) imagery: while SAR possesses irreplaceable all-weather, all-day observation capabilities, it is constrained by coherent imaging mechanisms, exhibiting significant modal heterogeneity with general images. To address this inconsistency, we propose FUSAR-KLIP, the first knowledge-guided general multimodal foundational model for SAR, along with reusable data and evaluation baselines. Specifically: (1) FUSAR-GEOVL-1M (the first large-scale SAR dataset with complete geographic projection attributes) was constructed, covering multiple satellite platforms, 120,000 images, and 135 cities; (2) Aligned structured text was generated through hierarchical cognitive thought chains, accurately encoding more than 1 million multidimensional semantic information from geomorphological environment and regional attributes to spatial relationships; (3) A self-consistent iterative optimization mechanism was designed to guide cross-modal learning with this knowledge information consistent with human cognition and physical laws in a self-supervised closed loop consisting of contrast, matching, and reconstruction; (4) A unified evaluation benchmark was established in 11 typical downstream tasks in the two major categories of vision and language, and compared with 15 mainstream foundation models.

翻译：以视觉语言模型为代表的跨模态人工智能已在通用图像理解领域取得显著成功。然而，通用视觉表征与遥感图像解译之间存在根本的认知不一致性：遥感图像耦合了地形、地貌与空间结构，因而本质上要求模型具备深厚的地学理解能力。这一认知差异在合成孔径雷达（SAR）影像中进一步放大：SAR虽具备不可替代的全天候、全天时观测能力，但受相干成像机制制约，与通用图像存在显著的模态异质性。为解决这一不一致性，我们提出了FUSAR-KLIP——首个面向SAR的知识引导通用多模态基础模型，以及可复用的数据与评估基线。具体而言：（1）构建了FUSAR-GEOVL-1M（首个具备完整地理投影属性的大规模SAR数据集），涵盖多卫星平台、12万张影像及135个城市；（2）通过层级化认知思维链生成对齐的结构化文本，精确编码从地貌环境、区域属性到空间关系的超百万条多维语义信息；（3）设计了自洽的迭代优化机制，在由对比、匹配与重构构成的自监督闭环中，以符合人类认知与物理规律的知识信息引导跨模态学习；（4）在视觉与语言两大类的11项典型下游任务中建立了统一评估基准，并与15个主流基础模型进行了对比。