Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable across backbones, and consistently improves the performance of ResNet and ViT models on ImageNet.
翻译:基于孪生网络的自监督学习方法通过对齐同一图像的不同视图来学习视觉表示。多裁剪策略将局部小尺度裁剪与全局裁剪相结合,增强了多种自监督学习框架的性能,但在BYOL、SimSiam和MoCo v3等基于预测器的架构中会导致训练不稳定。我们将此问题归因于所有视图共享同一预测器,并证明为每种视图类型分配独立预测器可稳定多裁剪训练,从而显著提升性能。拓展这一思路,我们将每种空间变换视为独立的对齐任务,并引入遮挡视图(即在编码前对图像部分区域进行掩码)。由此提出一种简单的非对称孪生自监督学习多任务框架,将全局视图、局部视图和遮挡视图统一整合。该方法具有训练稳定性,可广泛应用于不同骨干网络,并在ImageNet数据集上持续提升ResNet和ViT模型的性能表现。