Recent approaches in self-supervised learning of image representations can be categorized into different families of methods and, in particular, can be divided into contrastive and non-contrastive approaches. While differences between the two families have been thoroughly discussed to motivate new approaches, we focus more on the theoretical similarities between them. By designing contrastive and covariance based non-contrastive criteria that can be related algebraically and shown to be equivalent under limited assumptions, we show how close those families can be. We further study popular methods and introduce variations of them, allowing us to relate this theoretical result to current practices and show the influence (or lack thereof) of design choices on downstream performance. Motivated by our equivalence result, we investigate the low performance of SimCLR and show how it can match VICReg's with careful hyperparameter tuning, improving significantly over known baselines. We also challenge the popular assumption that non-contrastive methods need large output dimensions. Our theoretical and quantitative results suggest that the numerical gaps between contrastive and non-contrastive methods in certain regimes can be closed given better network design choices and hyperparameter tuning. The evidence shows that unifying different SOTA methods is an important direction to build a better understanding of self-supervised learning.
翻译:近期图像表示自监督学习的研究方法可分为不同类别,特别是可划分为对比式与无对比式两类方法。尽管这两类方法之间的差异已被广泛讨论以推动新方法的提出,我们更关注两者在理论上的相似性。通过设计可在代数上建立关联且在有限假设下被证明等价的对偶标准与基于协方差的无对比式标准,我们揭示了这两类方法的紧密关联。我们进一步研究主流方法并引入其变体,从而将这一理论成果与当前实践相联系,并展示设计选择对下游性能的影响(或无影响)。受等价性结论启发,我们探究了SimCLR的低性能问题,并展示通过精细超参数调优可使该方法的性能与VICReg相匹配,显著超越已知基线。我们还挑战了"无对比式方法需要大输出维度"的普遍假设。理论与量化结果表明,在特定条件下,通过优化网络设计选择与超参数调优,对比式与无对比式方法之间的数值差距可被弥合。这些证据表明,整合不同最先进方法是深化自监督学习理解的重要方向。