MixMask: Revisiting Masking Strategy for Siamese ConvNets

The recent progress in self-supervised learning has successfully combined Masked Image Modeling (MIM) with Siamese Networks, harnessing the strengths of both methodologies. Nonetheless, certain challenges persist when integrating conventional erase-based masking within Siamese ConvNets. Two primary concerns are: (1) The continuous data processing nature of ConvNets, which doesn't allow for the exclusion of non-informative masked regions, leading to reduced training efficiency compared to ViT architecture; (2) The misalignment between erase-based masking and the contrastive-based objective, distinguishing it from the MIM technique. To address these challenges, this work introduces a novel filling-based masking approach, termed \textbf{MixMask}. The proposed method replaces erased areas with content from a different image, effectively countering the information depletion seen in traditional masking methods. Additionally, we unveil an adaptive loss function that captures the semantics of the newly patched views, ensuring seamless integration within the architectural framework. We empirically validate the effectiveness of our approach through comprehensive experiments across various datasets and application scenarios. The findings underscore our framework's enhanced performance in areas such as linear probing, semi-supervised and supervised finetuning, object detection and segmentation. Notably, our method surpasses the MSCN, establishing MixMask as a more advantageous masking solution for Siamese ConvNets. Our code and models are publicly available at https://github.com/kirill-vish/MixMask.

翻译：自监督学习的最新进展成功地将掩码图像建模（MIM）与孪生网络相结合，充分发挥了两种方法的优势。然而，在孪生卷积网络中整合传统的基于擦除的掩码策略仍存在一些挑战。两个主要问题是：（1）卷积网络具有连续数据处理特性，无法排除无信息的掩码区域，导致其训练效率低于ViT架构；（2）基于擦除的掩码方式与基于对比的目标函数之间存在错配，这使其区别于MIM技术。为应对这些挑战，本研究提出了一种新颖的基于填充的掩码方法，称为\textbf{MixMask}。该方法将擦除区域替换为来自不同图像的内容，有效缓解了传统掩码方法中的信息损耗问题。此外，我们提出了一种自适应损失函数，能够捕捉新构建补丁视图的语义信息，确保其在架构框架中的无缝集成。我们通过跨多个数据集和应用场景的综合实验，实证验证了所提方法的有效性。研究结果突显了我们的框架在线性探测、半监督与监督微调、目标检测及分割等任务中的优越性能。值得注意的是，本方法超越了MSCN，确立了MixMask作为孪生卷积网络更具优势的掩码解决方案。我们的代码与模型已在https://github.com/kirill-vish/MixMask公开。