One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method, and using this network on downstream tasks but with its last few projector layers entirely removed. This trick of throwing away the projector is actually critical for SSL methods to display competitive performances on ImageNet for which more than 30 percentage points can be gained that way. This is a little vexing, as one would hope that the network layer at which invariance is explicitly enforced by the SSL criterion during training (the last projector layer) should be the one to use for best generalization performance downstream. But it seems not to be, and this study sheds some light on why. This trick, which we name Guillotine Regularization (GR), is in fact a generically applicable method that has been used to improve generalization performance in transfer learning scenarios. In this work, we identify the underlying reasons behind its success and show that the optimal layer to use might change significantly depending on the training setup, the data or the downstream task. Lastly, we give some insights on how to reduce the need for a projector in SSL by aligning the pretext SSL task and the downstream task.
翻译:近年来出现的一种意外技术是:使用自监督学习方法训练深度网络,然后将该网络用于下游任务时完全移除最后几个投影层。这种丢弃投影层的技巧对于自监督学习方法在ImageNet上展现竞争力至关重要——通过这种方式可提升超过30个百分点的性能。这令人略感困惑,因为人们通常期望在训练期间由自监督学习准则明确施加不变性的网络层(即最后一个投影层)应该是最适合下游任务泛化性能的层。但事实似乎并非如此,本研究就此现象进行了阐释。我们将这种技巧命名为"断头台正则化",它实际上是一种可泛化应用的方法,已在迁移学习场景中用于提升泛化性能。本工作揭示了该方法成功背后的深层原因,并证明最优网络层的选择可能因训练设置、数据或下游任务的不同而显著变化。最后,我们提出了若干见解,说明如何通过对齐自监督预训练任务与下游任务来减少对投影层的依赖。