This paper proposes the DistillCSE framework, which performs contrastive learning under the self-training paradigm with knowledge distillation. The potential advantage of DistillCSE is its self-enhancing feature: using a base model to provide additional supervision signals, a stronger model may be learned through knowledge distillation. However, the vanilla DistillCSE through the standard implementation of knowledge distillation only achieves marginal improvements due to severe overfitting. The further quantitative analyses demonstrate the reason that the standard knowledge distillation exhibits a relatively large variance of the teacher model's logits due to the essence of contrastive learning. To mitigate the issue induced by high variance, this paper accordingly proposed two simple yet effective solutions for knowledge distillation: a Group-P shuffling strategy as an implicit regularization and the averaging logits from multiple teacher components. Experiments on standard benchmarks demonstrate that the proposed DistillCSE outperforms many strong baseline methods and yields a new state-of-the-art performance.
翻译:本文提出DistillCSE框架,该框架在自训练范式下结合知识蒸馏进行对比学习。DistillCSE的潜在优势在于其自我增强特性:通过使用基础模型提供额外监督信号,可借助知识蒸馏学习到更强的模型。然而,采用标准知识蒸馏实现的原始DistillCSE因存在严重过拟合问题,仅能取得微弱的性能提升。进一步定量分析表明,由于对比学习的本质特性,标准知识蒸馏中教师模型的logits存在较大方差是导致该问题的根源。为缓解高方差引发的问题,本文相应地提出了两种简单有效的知识蒸馏解决方案:一种作为隐式正则化的分组洗牌策略,以及来自多个教师组件的平均logits方法。在标准基准测试上的实验表明,所提出的DistillCSE优于众多强基线方法,并实现了最新的最优性能。