Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless, these works primarily focus on a single modality without temporal and geo-context modeling, hampering their capabilities for diverse tasks. In this study, we present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue, we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to temporal, and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically, it outperforms the latest models such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications.
翻译:先前关于遥感基础模型(RSFM)的研究揭示了构建地球观测通用模型的巨大潜力。然而,这些工作主要聚焦于单一模态,缺乏时间与地理上下文建模,从而限制了其应对多样化任务的能力。本研究提出SkySense——一个经过千亿级参数训练、在包含2150万时间序列的精选多模态遥感影像数据集上预训练的通用大规模模型。SkySense采用因子化多模态时空编码器,以光学与合成孔径雷达数据的时间序列为输入。该编码器通过我们提出的多粒度对比学习进行预训练,以学习跨不同模态与空间粒度的表征。为进一步利用地理上下文线索增强遥感影像表征,我们引入地理上下文原型学习,在遥感影像的多模态时空特征基础上学习区域感知原型。据我们所知,SkySense是迄今为止规模最大的多模态遥感基础模型,其各模块可灵活组合或单独使用以适应不同任务。在涵盖7类任务、16个数据集的全面评估中,SkySense展现出卓越的泛化能力——覆盖单模态到多模态、静态到动态、分类到定位等场景。在所有测试场景中,SkySense均超越18种近期提出的遥感基础模型。具体而言,其相较于GFM、SatLas和Scale-MAE等最新模型取得显著优势,平均性能分别提升2.76%、3.67%和3.61%。我们将公开发布预训练权重,以促进未来研究与地球观测应用。