Self-supervised learning (SSL) has recently achieved tremendous empirical advancements in learning image representation. However, our understanding of the principle behind learning such a representation is still limited. This work shows that joint-embedding SSL approaches primarily learn a representation of image patches, which reflects their co-occurrence. Such a connection to co-occurrence modeling can be established formally, and it supplements the prevailing invariance perspective. We empirically show that learning a representation for fixed-scale patches and aggregating local patch representations as the image representation achieves similar or even better results than the baseline methods. We denote this process as BagSSL. Even with 32x32 patch representation, BagSSL achieves 62% top-1 linear probing accuracy on ImageNet. On the other hand, with a multi-scale pretrained model, we show that the whole image embedding is approximately the average of local patch embeddings. While the SSL representation is relatively invariant at the global scale, we show that locality is preserved when we zoom into local patch-level representation. Further, we show that patch representation aggregation can improve various SOTA baseline methods by a large margin. The patch representation is considerably easier to understand, and this work makes a step to demystify self-supervised representation learning.
翻译:自监督学习(SSL)近期在学习图像表征方面取得了显著的经验进展。然而,我们对这种表征学习背后原理的理解仍然有限。本研究表明,联合嵌入型SSL方法主要学习图像块的表征,该表征反映了这些图像块的共现性。这种与共现建模的联系可以形式化地建立,并补充了当前主流的不变性视角。我们通过实验证明,学习固定尺度图像块的表征,并将局部块表征聚合为图像表征的方法,可以达到甚至优于基线方法的效果。我们将此过程称为BagSSL。即使采用32×32的图像块表征,BagSSL在ImageNet上的线性探测top-1准确率仍能达到62%。另一方面,通过多尺度预训练模型,我们发现全局图像嵌入近似等于局部图像块嵌入的平均值。尽管SSL表征在全局尺度上表现出相对不变性,但当我们聚焦于局部块级表征时,其局部性仍得以保持。此外,我们证明图像块表征的聚合可以大幅提升多种当前最优基线方法的性能。图像块表征更易于理解,本研究为揭开自监督表征学习的神秘面纱迈出了重要一步。