Masked Image Modeling (MIM) is a technique in self-supervised learning that focuses on acquiring detailed visual representations from unlabeled images by estimating the missing pixels in randomly masked sections. It has proven to be a powerful tool for the preliminary training of Vision Transformers (ViTs), yielding impressive results across various tasks. Nevertheless, most MIM methods heavily depend on the random masking strategy to formulate the pretext task. This strategy necessitates numerous trials to ascertain the optimal dropping ratio, which can be resource-intensive, requiring the model to be pre-trained for anywhere between 800 to 1600 epochs. Furthermore, this approach may not be suitable for all datasets. In this work, we propose a new masking strategy that effectively helps the model capture global and local features. Based on this masking strategy, SymMIM, our proposed training pipeline for MIM is introduced. SymMIM achieves a new SOTA accuracy of 85.9\% on ImageNet using ViT-Large and surpasses previous SOTA across downstream tasks such as image classification, semantic segmentation, object detection, instance segmentation tasks, and so on.
翻译:掩码图像建模(Masked Image Modeling,MIM)是自监督学习中的一种技术,其核心在于通过预测随机掩码区域中缺失的像素,从未标注图像中学习精细的视觉表征。该方法已被证明是视觉Transformer(Vision Transformers,ViTs)预训练的强大工具,在多种任务上取得了显著效果。然而,大多数MIM方法严重依赖随机掩码策略来构建预训练任务。该策略需要大量实验来确定最优掩码比例,这通常需要模型进行800至1600轮预训练,计算成本高昂。此外,这种策略可能并非适用于所有数据集。本文提出了一种新的掩码策略,能有效帮助模型捕获全局与局部特征。基于此策略,我们进一步提出了名为SymMIM的MIM训练框架。在ImageNet数据集上,SymMIM使用ViT-Large模型取得了85.9%的准确率,刷新了当前最优性能,并在图像分类、语义分割、目标检测、实例分割等下游任务中全面超越了已有最佳方法。