Remote sensing scene classification has been extensively studied for its critical roles in geological survey, oil exploration, traffic management, earthquake prediction, wildfire monitoring, and intelligence monitoring. In the past, the Machine Learning (ML) methods for performing the task mainly used the backbones pretrained in the manner of supervised learning (SL). As Masked Image Modeling (MIM), a self-supervised learning (SSL) technique, has been shown as a better way for learning visual feature representation, it presents a new opportunity for improving ML performance on the scene classification task. This research aims to explore the potential of MIM pretrained backbones on four well-known classification datasets: Merced, AID, NWPU-RESISC45, and Optimal-31. Compared to the published benchmarks, we show that the MIM pretrained Vision Transformer (ViTs) backbones outperform other alternatives (up to 18% on top 1 accuracy) and that the MIM technique can learn better feature representation than the supervised learning counterparts (up to 5% on top 1 accuracy). Moreover, we show that the general-purpose MIM-pretrained ViTs can achieve competitive performance as the specially designed yet complicated Transformer for Remote Sensing (TRS) framework. Our experiment results also provide a performance baseline for future studies.
翻译:遥感场景分类在地质调查、石油勘探、交通管理、地震预测、野火监测及情报监控等领域具有关键作用,因此得到了广泛研究。以往,执行该任务的机器学习(ML)方法主要采用通过监督学习(SL)方式预训练的主干网络。随着掩码图像建模(MIM)这一自监督学习(SSL)技术被证明是学习视觉特征表示的更优方法,它为提升ML在场景分类任务中的性能提供了新的契机。本研究旨在探索MIM预训练主干网络在四个知名分类数据集(Merced、AID、NWPU-RESISC45和Optimal-31)上的潜力。与已发表的基准结果相比,我们证明了MIM预训练的视觉Transformer(ViT)主干网络优于其他备选方案(top-1准确率最高提升18%),并且MIM技术能够学习比监督学习对应方法更好的特征表示(top-1准确率最高提升5%)。此外,研究表明通用型MIM预训练ViT能够达到与专为遥感设计的复杂Transformer(TRS)框架相竞争的性能。我们的实验结果也为未来研究提供了性能基线。