Remote sensing images present unique challenges to image analysis due to the extensive geographic coverage, hardware limitations, and misaligned multi-scale images. This paper revisits the classical multi-scale representation learning problem but under the general framework of self-supervised learning for remote sensing image understanding. We present Cross-Scale MAE, a self-supervised model built upon the Masked Auto-Encoder (MAE).During pre-training, Cross-Scale MAE employs scale augmentation techniques and enforces cross-scale consistency constraints through both contrastive and generative losses to ensure consistent and meaningful representations well-suited for a wide range of downstream tasks. Further, our implementation leverages the xFormers library to accelerate network pre-training on a single GPU while maintaining the quality of learned representations. Experimental evaluations demonstrate that Cross-Scale MAE exhibits superior performance compared to standard MAE and other state-of-the-art remote sensing MAE methods.
翻译:遥感图像因其广阔的地理覆盖范围、硬件限制及多尺度图像对齐问题,给图像分析带来独特挑战。本文在遥感图像理解的自监督学习框架下,重新探讨了经典的多尺度表示学习问题。我们提出了Cross-Scale MAE,一种基于掩码自编码器的自监督模型。预训练过程中,Cross-Scale MAE采用尺度增强技术,并通过对比损失与生成损失施加跨尺度一致性约束,从而确保生成适用于多种下游任务的统一且富有意义的表示。此外,我们的实现借助xFormers库在单个GPU上加速网络预训练,同时保持所学表示的质量。实验评估表明,Cross-Scale MAE在性能上优于标准MAE及其他最先进的遥感MAE方法。