PAD: Self-Supervised Pre-Training with Patchwise-Scale Adapter for Infrared Images

Self-supervised learning (SSL) for RGB images has achieved significant success, yet there is still limited research on SSL for infrared images, primarily due to three prominent challenges: 1) the lack of a suitable large-scale infrared pre-training dataset, 2) the distinctiveness of non-iconic infrared images rendering common pre-training tasks like masked image modeling (MIM) less effective, and 3) the scarcity of fine-grained textures making it particularly challenging to learn general image features. To address these issues, we construct a Multi-Scene Infrared Pre-training (MSIP) dataset comprising 178,756 images, and introduce object-sensitive random RoI cropping, an image preprocessing method, to tackle the challenge posed by non-iconic images. To alleviate the impact of weak textures on feature learning, we propose a pre-training paradigm called Pre-training with ADapter (PAD), which uses adapters to learn domain-specific features while freezing parameters pre-trained on ImageNet to retain the general feature extraction capability. This new paradigm is applicable to any transformer-based SSL method. Furthermore, to achieve more flexible coordination between pre-trained and newly-learned features in different layers and patches, a patchwise-scale adapter with dynamically learnable scale factors is introduced. Extensive experiments on three downstream tasks show that PAD, with only 1.23M pre-trainable parameters, outperforms other baseline paradigms including continual full pre-training on MSIP. Our code and dataset are available at https://github.com/casiatao/PAD.

翻译：自监督学习（SSL）在RGB图像领域已取得显著成功，但面向红外图像的SSL研究仍十分有限，主要面临三个突出挑战：1）缺乏合适的大规模红外预训练数据集；2）非标志性红外图像的特殊性导致掩码图像建模（MIM）等通用预训练任务效果不佳；3）细粒度纹理的稀缺使得学习通用图像特征尤为困难。为解决这些问题，我们构建了一个包含178,756张图像的多场景红外预训练（MSIP）数据集，并提出一种名为"目标敏感随机RoI裁剪"的图像预处理方法，以应对非标志性图像带来的挑战。为缓解弱纹理对特征学习的影响，我们提出一种称为"带适配器的预训练（PAD）"预训练范式，该方法在冻结ImageNet预训练参数以保留通用特征提取能力的同时，利用适配器学习领域特定特征。这一新范式可适用于任何基于Transformer的SSL方法。此外，为实现不同层与不同分块间预训练特征与新学习特征的更灵活协调，我们引入了一种具有动态可学习尺度因子的分块尺度适配器。在三个下游任务上的大量实验表明，仅需1.23M可训练参数的PAD方法，其性能优于包括在MSIP上持续全量预训练在内的其他基线范式。我们的代码与数据集已在https://github.com/casiatao/PAD公开。