Most self-supervised methods for representation learning leverage a cross-view consistency objective i.e., they maximize the representation similarity of a given image's augmented views. Recent work NNCLR goes beyond the cross-view paradigm and uses positive pairs from different images obtained via nearest neighbor bootstrapping in a contrastive setting. We empirically show that as opposed to the contrastive learning setting which relies on negative samples, incorporating nearest neighbor bootstrapping in a self-distillation scheme can lead to a performance drop or even collapse. We scrutinize the reason for this unexpected behavior and provide a solution. We propose to adaptively bootstrap neighbors based on the estimated quality of the latent space. We report consistent improvements compared to the naive bootstrapping approach and the original baselines. Our approach leads to performance improvements for various self-distillation method/backbone combinations and standard downstream tasks. Our code is publicly available at https://github.com/tileb1/AdaSim.
翻译:大多数用于表示学习的自监督方法都采用跨视图一致性目标,即最大化同一图像不同增强视图的表示相似性。近期工作NNCLR突破了跨视图范式,在对比学习框架中通过最近邻引导从不同图像获取正样本对。我们通过实验证明:与依赖负样本的对比学习不同,将最近邻引导引入自蒸馏方案可能导致性能下降甚至完全坍缩。我们深入剖析了这一异常行为的原因并提出了解决方案——基于隐空间质量估计自适应地选择邻居。相较于朴素引导方法和原始基准模型,我们的方法在多个自蒸馏方法与骨干网络组合及标准下游任务中均取得了持续改进。代码已开源在https://github.com/tileb1/AdaSim。