Self-supervised learning (SSL) models have recently demonstrated remarkable performance across various tasks, including image segmentation. This study delves into the emergent characteristics of the Self-Distillation with No Labels (DINO) algorithm and its application to Synthetic Aperture Radar (SAR) imagery. We pre-train a vision transformer (ViT)-based DINO model using unlabeled SAR data, and later fine-tune the model to predict high-resolution land cover maps. We rigorously evaluate the utility of attention maps generated by the ViT backbone, and compare them with the model's token embedding space. We observe a small improvement in model performance with pre-training compared to training from scratch, and discuss the limitations and opportunities of SSL for remote sensing and land cover segmentation. Beyond small performance increases, we show that ViT attention maps hold great intrinsic value for remote sensing, and could provide useful inputs to other algorithms. With this, our work lays the ground-work for bigger and better SSL models for Earth Observation.
翻译:自监督学习(SSL)模型近年来在包括图像分割在内的多种任务中展现出显著性能。本研究深入分析了无标签自蒸馏(DINO)算法的涌现特性及其在合成孔径雷达(SAR)影像中的应用。我们利用未标注的SAR数据预训练基于Vision Transformer(ViT)的DINO模型,随后微调该模型以预测高分辨率土地覆盖图。我们严格评估了ViT主干网络生成的注意力图的有效性,并将其与模型的令牌嵌入空间进行对比。相较于从头训练,预训练仅带来模型性能的微小提升,我们探讨了SSL在遥感与土地覆盖分割中的局限性与机遇。除性能小幅提升外,我们证明ViT注意力图对遥感具有显著的内在价值,可为其他算法提供有用输入。基于此,本研究为地球观测领域构建更大规模、更优的SSL模型奠定了基础。