Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks by pre-training on large amount of unlabelled data. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. Different from standard natural image datasets, remote sensing data is acquired from various sensor technologies and exhibit diverse range of scale variations as well as modalities. Existing satellite image pre-training methods either ignore the scale information present in the remote sensing imagery or restrict themselves to use only a single type of data modality. In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities. Our proposed approach, named SatMAE++, performs multi-scale pre-training and utilizes convolution based upsampling blocks to reconstruct the image at higher scales making it extensible to include more scales. Compared to existing works, the proposed SatMAE++ with multi-scale pre-training is equally effective for both optical as well as multi-spectral imagery. Extensive experiments on six datasets reveal the merits of proposed contributions, leading to state-of-the-art performance on all datasets. SatMAE++ achieves mean average precision (mAP) gain of 2.5\% for multi-label classification task on BigEarthNet dataset. Our code and pre-trained models are available at \url{https://github.com/techmn/satmae_pp}.
翻译:无监督学习的最新进展表明,通过在海量无标注数据上预训练,大型视觉模型能够在下游任务中取得优异结果。由于遥感领域同样存在大量无标注数据,这类预训练技术近期也被引入该领域。与标准自然图像数据集不同,遥感数据由多种传感器技术采集,呈现出多样化的尺度变化及模态特征。现有的卫星图像预训练方法要么忽略遥感影像中的尺度信息,要么局限于仅使用单一数据模态。本文重新审视了Transformer预训练方法,并利用多尺度信息与多种模态实现有效协同。我们提出的方法名为SatMAE++,通过多尺度预训练,采用基于卷积的上采样模块重建更高尺度的图像,使其易于扩展至更多尺度。与现有工作相比,所提出的SatMAE++多尺度预训练方法在光学和光谱影像上均具有同等效力。在六个数据集上的大量实验揭示了本文贡献的优势,在所有数据集上均达到最优性能。SatMAE++在BigEarthNet数据集的多标签分类任务中平均精度(mAP)提升2.5%。我们的代码和预训练模型已开源在\url{https://github.com/techmn/satmae_pp}。