In this paper, we are interested in understanding self-supervised pretraining through studying the capability that self-supervised representation pretraining methods learn part-aware representations. The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts. We explain that contrastive learning is a part-to-whole task: the projection layer hallucinates the whole object representation from the object part representation learned from the encoder, and that masked image modeling is a part-to-part task: the masked patches of the object are hallucinated from the visible patches. The explanation suggests that the self-supervised pretrained encoder is required to understand the object part. We empirically compare the off-the-shelf encoders pretrained with several representative methods on object-level recognition and part-level recognition. The results show that the fully-supervised model outperforms self-supervised models for object-level recognition, and most self-supervised contrastive learning and masked image modeling methods outperform the fully-supervised method for part-level recognition. It is observed that the combination of contrastive learning and masked image modeling further improves the performance.
翻译:摘要:本文旨在通过研究自监督表示预训练方法学习部件感知表示的能力,来理解自监督预训练。该研究主要基于以下观察:对比学习中使用的随机视图和掩码图像建模中使用的随机掩码(可见)块通常与物体部件相关。我们解释对比学习是一个从部分到整体的任务:投影层从编码器学习到的物体部件表示中幻构出完整的物体表示;而掩码图像建模是一个从部分到部分的任务:物体的掩码块从可见块中幻构出来。这一解释表明,自监督预训练编码器需要理解物体部件。我们通过实验比较了采用多种代表性方法预训练的现成编码器在物体级识别和部件级识别上的表现。结果表明,在全监督模型在物体级识别上优于自监督模型的同时,大多数自监督对比学习和掩码图像建模方法在部件级识别上优于全监督方法。研究观察到,对比学习与掩码图像建模的结合进一步提升了性能。