Addressing Sample Inefficiency in Multi-View Representation Learning

Non-contrastive self-supervised learning (NC-SSL) methods like BarlowTwins and VICReg have shown great promise for label-free representation learning in computer vision. Despite the apparent simplicity of these techniques, researchers must rely on several empirical heuristics to achieve competitive performance, most notably using high-dimensional projector heads and two augmentations of the same image. In this work, we provide theoretical insights on the implicit bias of the BarlowTwins and VICReg loss that can explain these heuristics and guide the development of more principled recommendations. Our first insight is that the orthogonality of the features is more critical than projector dimensionality for learning good representations. Based on this, we empirically demonstrate that low-dimensional projector heads are sufficient with appropriate regularization, contrary to the existing heuristic. Our second theoretical insight suggests that using multiple data augmentations better represents the desiderata of the SSL objective. Based on this, we demonstrate that leveraging more augmentations per sample improves representation quality and trainability. In particular, it improves optimization convergence, leading to better features emerging earlier in the training. Remarkably, we demonstrate that we can reduce the pretraining dataset size by up to 4x while maintaining accuracy and improving convergence simply by using more data augmentations. Combining these insights, we present practical pretraining recommendations that improve wall-clock time by 2x and improve performance on CIFAR-10/STL-10 datasets using a ResNet-50 backbone. Thus, this work provides a theoretical insight into NC-SSL and produces practical recommendations for enhancing its sample and compute efficiency.

翻译：非对比式自监督学习方法（如BarlowTwins和VICReg）在计算机视觉的无标签表示学习中展现出巨大潜力。尽管这些技术看似简单，研究人员仍需依赖多种经验性启发方法才能达到竞争性表现，尤其是使用高维投影头和同一图像的两个增广版本。本文从理论上揭示了BarlowTwins和VICReg损失的隐式偏差，可解释这些启发方法，并指导制定更具原则性的建议。我们的第一个发现是：特征正交性比投影头维度对学习优质表征更为关键。基于此，我们通过实验证明，在适当正则化条件下，低维投影头足以胜任，这与现有启发方法相悖。第二个理论发现表明，使用多重数据增广能更好地实现自监督学习目标。基于此，我们证明增加每个样本的增广次数可提升表征质量与可训练性，尤其能加速优化收敛，使更优特征在训练早期即显现。值得注意的是，仅通过增加数据增广，我们可在保持精度的同时将预训练数据集规模缩减4倍，并提升收敛速度。综合这些发现，我们提出实用预训练建议，可将运行时间缩短2倍，并在使用ResNet-50骨干网络时提升CIFAR-10/STL-10数据集上的性能。因此，本文为自监督学习提供了理论洞见，并产出增强其样本与计算效率的实用建议。