A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder

Vast amounts of remote sensing (RS) data provide Earth observations across multiple dimensions, encompassing critical spatial, temporal, and spectral information which is essential for addressing global-scale challenges such as land use monitoring, disaster prevention, and environmental change mitigation. Despite various pre-training methods tailored to the characteristics of RS data, a key limitation persists: the inability to effectively integrate spatial, temporal, and spectral information within a single unified model. To unlock the potential of RS data, we construct a Spatial-Temporal-Spectral Structured Dataset (STSSD) characterized by the incorporation of multiple RS sources, diverse coverage, unified locations within image sets, and heterogeneity within images. Building upon this structured dataset, we propose an Anchor-Aware Masked AutoEncoder method (A$^{2}$-MAE), leveraging intrinsic complementary information from the different kinds of images and geo-information to reconstruct the masked patches during the pre-training phase. A$^{2}$-MAE integrates an anchor-aware masking strategy and a geographic encoding module to comprehensively exploit the properties of RS images. Specifically, the proposed anchor-aware masking strategy dynamically adapts the masking process based on the meta-information of a pre-selected anchor image, thereby facilitating the training on images captured by diverse types of RS sources within one model. Furthermore, we propose a geographic encoding method to leverage accurate spatial patterns, enhancing the model generalization capabilities for downstream applications that are generally location-related. Extensive experiments demonstrate our method achieves comprehensive improvements across various downstream tasks compared with existing RS pre-training methods, including image classification, semantic segmentation, and change detection tasks.

翻译：海量的遥感数据提供了地球多维度观测信息，涵盖关键的空间、时间与光谱信息，这些信息对于解决土地利用监测、灾害防御和环境变化减缓等全球性挑战至关重要。尽管已有多种针对遥感数据特性的预训练方法，但一个关键局限仍然存在：无法在单一统一模型中有效整合空间、时间与光谱信息。为释放遥感数据潜力，我们构建了空-时-谱结构化数据集（STSSD），其特点在于融合多源遥感数据、覆盖范围广泛、图像集中位置统一且图像内存在异质性。基于该结构化数据集，我们提出锚点感知掩码自编码器方法（A$^{2}$-MAE），在预训练阶段利用不同影像与地理信息的内在互补信息来重建被掩码的图像块。A$^{2}$-MAE集成了锚点感知掩码策略与地理编码模块，以全面挖掘遥感影像的特性。具体而言，所提出的锚点感知掩码策略根据预选锚点图像的元信息动态调整掩码过程，从而在单一模型中促进对不同类型遥感源图像的训练。此外，我们提出一种地理编码方法以利用精确的空间模式，增强模型在通常与位置相关的下游应用中的泛化能力。大量实验表明，与现有遥感预训练方法相比，本方法在图像分类、语义分割和变化检测等多种下游任务中均取得了全面改进。