Joint-embedding self-supervised learning (SSL), the key paradigm for unsupervised representation learning from visual data, learns from invariances between semantically-related data pairs. We study the one-to-many mapping problem in SSL, where each datum may be mapped to multiple valid targets. This arises when data pairs come from naturally occurring generative processes, e.g., successive video frames. We show that existing methods struggle to flexibly capture this conditional uncertainty. As a remedy, we introduce a latent variable to account for this uncertainty and derive a variational lower bound on the mutual information between paired embeddings. Our derivation yields a simple regularization term for standard SSL objectives. The resulting method, which we call AdaSSL, applies to both contrastive and distillation-based SSL objectives, and we empirically show its versatility in causal representation learning, fine-grained image understanding, and world modeling on videos.
翻译:联合嵌入自监督学习(SSL)作为从视觉数据中进行无监督表征学习的关键范式,其核心思想是从语义相关数据对之间的不变性中学习。本文研究SSL中的一对多映射问题,即每个数据点可能映射到多个有效目标的情况。当数据对来自自然发生的生成过程(例如连续视频帧)时,这一问题便会出现。我们证明现有方法难以灵活捕捉这种条件不确定性。为解决此问题,我们引入潜变量来描述这种不确定性,并推导出配对嵌入间互信息的变分下界。该推导为标准的SSL目标函数产生了一个简单的正则化项。我们将所提出的方法命名为AdaSSL,它可同时适用于基于对比学习和基于知识蒸馏的SSL目标函数。我们通过实证研究展示了该方法在因果表征学习、细粒度图像理解以及视频世界建模任务中的通用性。