Masking Improves Contrastive Self-Supervised Learning for ConvNets, and Saliency Tells You Where

While image data starts to enjoy the simple-but-effective self-supervised learning scheme built upon masking and self-reconstruction objective thanks to the introduction of tokenization procedure and vision transformer backbone, convolutional neural networks as another important and widely-adopted architecture for image data, though having contrastive-learning techniques to drive the self-supervised learning, still face the difficulty of leveraging such straightforward and general masking operation to benefit their learning process significantly. In this work, we aim to alleviate the burden of including masking operation into the contrastive-learning framework for convolutional neural networks as an extra augmentation method. In addition to the additive but unwanted edges (between masked and unmasked regions) as well as other adverse effects caused by the masking operations for ConvNets, which have been discussed by prior works, we particularly identify the potential problem where for one view in a contrastive sample-pair the randomly-sampled masking regions could be overly concentrated on important/salient objects thus resulting in misleading contrastiveness to the other view. To this end, we propose to explicitly take the saliency constraint into consideration in which the masked regions are more evenly distributed among the foreground and background for realizing the masking-based augmentation. Moreover, we introduce hard negative samples by masking larger regions of salient patches in an input image. Extensive experiments conducted on various datasets, contrastive learning mechanisms, and downstream tasks well verify the efficacy as well as the superior performance of our proposed method with respect to several state-of-the-art baselines.

翻译：尽管得益于分词过程的引入和视觉Transformer骨干网络，图像数据开始享受基于掩码和自重构目标的简单而有效的自监督学习方案，但卷积神经网络作为另一种重要且广泛采用的图像数据架构，尽管已利用对比学习技术驱动自监督学习，仍难以直接利用这种简单通用的掩码操作来显著提升其学习过程。在本工作中，我们旨在减轻将掩码操作作为额外数据增强方法纳入卷积神经网络对比学习框架的负担。除了先前工作讨论的掩码操作对ConvNets产生的附加但不受欢迎的边缘（位于掩码区域与非掩码区域之间）及其他不利影响外，我们特别识别了一个潜在问题：在对比样本对的一个视图中，随机采样的掩码区域可能过度集中于重要/显著物体上，从而导致对另一视图产生误导性的对比性。为此，我们提出显式考虑显著性约束，使得掩码区域在前景和背景间更均匀分布，以实现基于掩码的数据增强。此外，我们通过掩码输入图像中显著区域的更大范围来引入困难负样本。在多种数据集、对比学习机制及下游任务上开展的大量实验充分验证了我们提出的方法在多个最先进基线中的有效性和优越性能。