An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. Despite its proven practical effectiveness, the reason behind the success of this technique is poorly understood. The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better? In this work, we provide a rigorous theoretical answer to this question. We start by examining linear models trained with self-supervised contrastive loss. We reveal that the implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers. Consequently, lower layers tend to have more normalized and less specialized representations. We theoretically characterize scenarios where such representations are more beneficial, highlighting the intricate interplay between data augmentation and input features. Additionally, we demonstrate that introducing non-linearity into the network allows lower layers to learn features that are completely absent in higher layers. Finally, we show how this mechanism improves the robustness in supervised contrastive learning and supervised learning. We empirically validate our results through various experiments on CIFAR-10/100, UrbanCars and shifted versions of ImageNet. We also introduce a potential alternative to projection head, which offers a more interpretable and controllable design.
翻译:获得高质量表征的有效技术之一是在训练期间在编码器顶部添加投影头,随后丢弃该投影头并使用投影前的表征。尽管该技术在实践中已证明有效,但其成功背后的原因尚不明确。投影前的表征并未直接通过损失函数进行优化,这引发了一个问题:是什么使它们更为优越?在本工作中,我们对该问题提供了严格的理论解答。我们从使用自监督对比损失训练的线性模型入手,揭示了训练算法的隐式偏好会导致逐层渐进的特征加权,即越深入网络层,特征的不平等性越显著。因此,较低层往往具有更归一化且更不专门化的表征。我们从理论上刻画了此类表征更为有益的场景,强调了数据增强与输入特征之间复杂的相互作用。此外,我们证明在网络中引入非线性后,较低层能够学习到高层完全缺失的特征。最后,我们展示了这一机制如何提高监督对比学习与监督学习中的鲁棒性。通过在CIFAR-10/100、UrbanCars以及ImageNet移位版本上的多项实验,我们实证验证了理论结果。同时,我们提出了投影头的一种潜在替代方案,该方案具有更高的可解释性和可控性。