Reconstruction and joint embedding have emerged as two leading paradigms in Self Supervised Learning (SSL). Reconstruction methods focus on recovering the original sample from a different view in input space. On the other hand, joint embedding methods align the representations of different views in latent space. Both approaches offer compelling advantages, yet practitioners lack clear guidelines for choosing between them. In this work, we unveil the core mechanisms that distinguish each paradigm. By leveraging closed form solutions for both approaches, we precisely characterize how the view generation process, e.g. data augmentation, impacts the learned representations. We then demonstrate that, unlike supervised learning, both SSL paradigms require a minimal alignment between augmentations and irrelevant features to achieve asymptotic optimality with increasing sample size. Our findings indicate that in scenarios where these irrelevant features have a large magnitude, joint embedding methods are preferable because they impose a strictly weaker alignment condition compared to reconstruction based methods. These results not only clarify the trade offs between the two paradigms but also substantiate the empirical success of joint embedding approaches on real world challenging datasets.
翻译:重构与联合嵌入已成为自监督学习(SSL)中两种主导范式。重构方法侧重于从输入空间的不同视角恢复原始样本。另一方面,联合嵌入方法则在潜在空间中对齐不同视角的表征。两种方法均具有显著优势,但实践者缺乏在二者间做出选择的明确指导。本研究揭示了区分这两种范式的核心机制。通过利用两种方法的闭式解,我们精确刻画了视角生成过程(例如数据增强)如何影响学习到的表征。我们进而证明,与监督学习不同,两种SSL范式均需要增强操作与无关特征之间存在最小对齐度,才能随着样本量增加达到渐近最优性。我们的研究结果表明,当这些无关特征具有较大影响幅度时,联合嵌入方法更为可取,因为相较于基于重构的方法,它们施加了严格更弱的对齐条件。这些发现不仅阐明了两类范式间的权衡关系,也实证了联合嵌入方法在现实世界复杂数据集上取得的成功。