Multi-Level Contrastive Learning for Dense Prediction Task

In this work, we present Multi-Level Contrastive Learning for Dense Prediction Task (MCL), an efficient self-supervised method for learning region-level feature representation for dense prediction tasks. Our method is motivated by the three key factors in detection: localization, scale consistency and recognition. To explicitly encode absolute position and scale information, we propose a novel pretext task that assembles multi-scale images in a montage manner to mimic multi-object scenarios. Unlike the existing image-level self-supervised methods, our method constructs a multi-level contrastive loss that considers each sub-region of the montage image as a singleton. Our method enables the neural network to learn regional semantic representations for translation and scale consistency while reducing pre-training epochs to the same as supervised pre-training. Extensive experiments demonstrate that MCL consistently outperforms the recent state-of-the-art methods on various datasets with significant margins. In particular, MCL obtains 42.5 AP$^\mathrm{bb}$ and 38.3 AP$^\mathrm{mk}$ on COCO with the 1x schedule fintuning, when using Mask R-CNN with R50-FPN backbone pre-trained with 100 epochs. In comparison to MoCo, our method surpasses their performance by 4.0 AP$^\mathrm{bb}$ and 3.1 AP$^\mathrm{mk}$. Furthermore, we explore the alignment between pretext task and downstream tasks. We extend our pretext task to supervised pre-training, which achieves a similar performance to self-supervised learning. This result demonstrates the importance of the alignment between pretext task and downstream tasks, indicating the potential for wider applicability of our method beyond self-supervised settings.

翻译：在本文中，我们提出了用于密集预测任务的多层级对比学习（MCL），这是一种高效的自我监督方法，用于学习密集预测任务中的区域级特征表示。我们的方法源于检测中的三个关键因素：定位、尺度一致性和识别。为明确编码绝对位置和尺度信息，我们提出了一种新颖的前置任务，该任务以蒙太奇方式组合多尺度图像，以模拟多目标场景。与现有的图像级自我监督方法不同，我们的方法构建了一个多层级对比损失，将蒙太奇图像的每个子区域视为一个独立个体。我们的方法使神经网络能够学习用于平移和尺度一致性的区域语义表示，同时将预训练轮次减少到与监督预训练相同的数量。大量实验表明，MCL在各种数据集上以显著优势持续优于近期最先进的方法。特别地，在使用Mask R-CNN和R50-FPN主干网络进行100轮次预训练后，MCL在COCO数据集上通过1x计划微调获得了42.5 AP$^\mathrm{bb}$和38.3 AP$^\mathrm{mk}$。与MoCo相比，我们的方法在AP$^\mathrm{bb}$上超出4.0，在AP$^\mathrm{mk}$上超出3.1。此外，我们探索了前置任务与下游任务之间的对齐。我们将前置任务扩展到监督预训练，其性能与自我监督学习相当。这一结果证明了前置任务与下游任务对齐的重要性，表明了我们的方法在自我监督设置之外具有更广泛的应用潜力。