In this paper, we are interested in understanding self-supervised pretraining through studying the capability that self-supervised representation pretraining methods learn part-aware representations. The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts. We explain that contrastive learning is a part-to-whole task: the projection layer hallucinates the whole object representation from the object part representation learned from the encoder, and that masked image modeling is a part-to-part task: the masked patches of the object are hallucinated from the visible patches. The explanation suggests that the self-supervised pretrained encoder is required to understand the object part. We empirically compare the off-the-shelf encoders pretrained with several representative methods on object-level recognition and part-level recognition. The results show that the fully-supervised model outperforms self-supervised models for object-level recognition, and most self-supervised contrastive learning and masked image modeling methods outperform the fully-supervised method for part-level recognition. It is observed that the combination of contrastive learning and masked image modeling further improves the performance.
翻译:本文旨在通过研究自监督表示预训练方法学习部件感知表示的能力,来理解自监督预训练过程。该研究主要源于对比学习中使用的随机视图和掩码图像建模中使用的随机掩码(可见)块通常对应物体部件这一观察。我们阐释对比学习是一种"部分到整体"任务:投影层从编码器学习的物体部件表示中幻化出完整的物体表示;而掩码图像建模则是一种"部件到部件"任务:物体的被掩码块从可见块中幻化出来。这一解释表明,自监督预训练编码器需要理解物体部件。我们通过实验对比了多种代表性方法预训练的现成编码器在物体级识别和部件级识别上的表现。结果表明,全监督模型在物体级识别上优于自监督模型,而大多数自监督对比学习和掩码图像建模方法在部件级识别上优于全监督方法。观察发现,对比学习与掩码图像建模的结合可进一步提升性能。