Self-supervised learning (SSL) learns representations by leveraging an auxiliary unsupervised task, such as classifying semantically related samples, e.g. different data augmentations or modalities. Of the many approaches to SSL, contrastive methods, e.g. SimCLR, CLIP and VicREG, have gained attention for learning representations that achieve downstream performance close to that of supervised learning. However, a theoretical understanding of the mechanism behind these methods eludes. We propose a generative latent variable model for the data and show that several families of discriminative self-supervised algorithms, including contrastive methods, approximately induce its latent structure over representations, providing a unifying theoretical framework. We also justify links to mutual information and the use of a projection head. Fitting our model generatively, as SimVE, improves performance over previous VAE methods on common benchmarks (e.g. FashionMNIST, CIFAR10, CelebA), narrows the gap to discriminative methods on _content_ classification and, as our analysis predicts, outperforms them where _style_ information is required, taking a step toward task-agnostic representations.
翻译:自监督学习(SSL)通过利用辅助的无监督任务(例如对语义相关样本进行分类,如不同数据增强或模态)来学习表征。在众多SSL方法中,对比方法(如SimCLR、CLIP和VicREG)因能学习到下游性能接近监督学习的表征而备受关注。然而,关于这些方法背后的机制理论认知仍存在空白。我们为数据提出了一个生成式潜变量模型,并证明包括对比方法在内的若干判别式自监督算法家族会近似诱导其表征的潜结构,从而提供了统一的理论框架。我们还论证了与互信息的关联以及投影头的使用。通过生成式拟合我们的模型(即SimVE),该方法在常见基准测试(如FashionMNIST、CIFAR10、CelebA)上的性能优于以往VAE方法,在_内容_分类中缩小了与判别式方法的差距,并且正如我们的分析所预测的,在需要_风格_信息的任务中表现更优,从而向任务无关表征迈进一步。