The mechanisms behind the success of multi-view self-supervised learning (MVSSL) are not yet fully understood. Contrastive MVSSL methods have been studied through the lens of InfoNCE, a lower bound of the Mutual Information (MI). However, the relation between other MVSSL methods and MI remains unclear. We consider a different lower bound on the MI consisting of an entropy and a reconstruction term (ER), and analyze the main MVSSL families through its lens. Through this ER bound, we show that clustering-based methods such as DeepCluster and SwAV maximize the MI. We also re-interpret the mechanisms of distillation-based approaches such as BYOL and DINO, showing that they explicitly maximize the reconstruction term and implicitly encourage a stable entropy, and we confirm this empirically. We show that replacing the objectives of common MVSSL methods with this ER bound achieves competitive performance, while making them stable when training with smaller batch sizes or smaller exponential moving average (EMA) coefficients. Github repo: https://github.com/apple/ml-entropy-reconstruction.
翻译:多视角自监督学习(MVSSL)成功背后的机制尚未完全明晰。对比式MVSSL方法已通过信息噪声对比估计(InfoNCE)——互信息(MI)的一个下界——进行了研究。然而,其他MVSSL方法与MI之间的关系仍不清楚。我们考虑了一个由熵项和重构项(ER)组成的互信息下界,并以此为视角分析了主要的MVSSL方法族。通过该ER界,我们证明基于聚类的方法(如DeepCluster和SwAV)能够最大化MI。同时,我们重新解释了基于蒸馏的方法(如BYOL和DINO)的机制,表明它们显式最大化重构项,并隐式促进熵的稳定性,并通过实验验证了这一结论。我们证明,将常见MVSSL方法的目标函数替换为该ER界,可在保持竞争性能的同时,使模型在较小批量或较小指数移动平均(EMA)系数下训练时保持稳定。GitHub代码库:https://github.com/apple/ml-entropy-reconstruction。