In this paper, we introduce a novel self-supervised learning (SSL) loss for image representation learning. There is a growing belief that generalization in deep neural networks is linked to their ability to discriminate object shapes. Since object shape is related to the location of its parts, we propose to detect those that have been artificially misplaced. We represent object parts with image tokens and train a ViT to detect which token has been combined with an incorrect positional embedding. We then introduce sparsity in the inputs to make the model more robust to occlusions and to speed up the training. We call our method DILEMMA, which stands for Detection of Incorrect Location EMbeddings with MAsked inputs. We apply DILEMMA to MoCoV3, DINO and SimCLR and show an improvement in their performance of respectively 4.41%, 3.97%, and 0.5% under the same training time and with a linear probing transfer on ImageNet-1K. We also show full fine-tuning improvements of MAE combined with our method on ImageNet-100. We evaluate our method via fine-tuning on common SSL benchmarks. Moreover, we show that when downstream tasks are strongly reliant on shape (such as in the YOGA-82 pose dataset), our pre-trained features yield a significant gain over prior work.
翻译:本文提出一种新颖的自监督学习(SSL)损失函数,用于图像表示学习。日益增长的共识表明,深度神经网络的泛化能力与其区分物体形状的能力相关。由于物体形状与其各部分的位置密切相关,我们提出检测那些被人为错误放置的部分。我们利用图像令牌表示物体各部分,并训练视觉Transformer(ViT)检测哪个令牌被赋予了错误的位置嵌入。随后,我们在输入中引入稀疏性,使模型对遮挡更具鲁棒性并加速训练。我们将该方法称为DILEMMA(Detection of Incorrect Location EMbeddings with MAsked inputs)。我们将DILEMMA应用于MoCoV3、DINO和SimCLR,在相同训练时间下通过ImageNet-1K线性探测迁移实验,分别提升了4.41%、3.97%和0.5%的性能。我们还展示了将MAE与该方法结合后在ImageNet-100上的全微调改进。通过在常见自监督学习基准上的微调评估,我们进一步证明:当下游任务高度依赖形状信息时(如YOGA-82姿态数据集),我们的预训练特征相比先前工作取得了显著增益。