Object detection in remote sensing imagery plays a vital role in various Earth observation applications. However, unlike object detection in natural scene images, this task is particularly challenging due to the abundance of small, often barely visible objects across diverse terrains. To address these challenges, multimodal learning can be used to integrate features from different data modalities, thereby improving detection accuracy. Nonetheless, the performance of multimodal learning is often constrained by the limited size of labeled datasets. In this paper, we propose to use Masked Image Modeling (MIM) as a pre-training technique, leveraging self-supervised learning on unlabeled data to enhance detection performance. However, conventional MIM such as MAE which uses masked tokens without any contextual information, struggles to capture the fine-grained details due to a lack of interactions with other parts of image. To address this, we propose a new interactive MIM method that can establish interactions between different tokens, which is particularly beneficial for object detection in remote sensing. The extensive ablation studies and evluation demonstrate the effectiveness of our approach.
翻译:遥感图像中的目标检测在各种地球观测应用中起着至关重要的作用。然而,与自然场景图像中的目标检测不同,由于在多样地形中存在大量微小且通常难以察觉的目标,这项任务尤其具有挑战性。为应对这些挑战,可利用多模态学习来整合来自不同数据模态的特征,从而提高检测精度。尽管如此,多模态学习的性能往往受限于标注数据集的有限规模。在本文中,我们提出使用掩码图像建模作为一种预训练技术,利用对未标注数据的自监督学习来提升检测性能。然而,传统的MIM方法(如MAE)使用缺乏任何上下文信息的掩码标记,由于缺乏与图像其他部分的交互,难以捕捉细粒度细节。为解决此问题,我们提出了一种新的交互式MIM方法,能够在不同标记之间建立交互,这对于遥感目标检测尤为有益。广泛的消融研究和评估证明了我们方法的有效性。