Representation learning has been evolving from traditional supervised training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous works have demonstrated their pros and cons in specific scenarios, i.e., CL and supervised pre-training excel at capturing longer-range global patterns and enabling better feature discrimination, while MIM can introduce more local and diverse attention across all transformer layers. In this paper, we explore how to obtain a model that combines their strengths. We start by examining previous feature distillation and mask feature reconstruction methods and identify their limitations. We find that their increasing diversity mainly derives from the asymmetric designs, but these designs may in turn compromise the discrimination ability. In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy, which utilizes both the supervised/CL teacher and the MIM teacher to jointly guide the student model. Hybrid Distill imitates the token relations of the MIM teacher to alleviate attention collapse, as well as distills the feature maps of the supervised/CL teacher to enable discrimination. Furthermore, a progressive redundant token masking strategy is also utilized to reduce the distilling costs and avoid falling into local optima. Experiment results prove that Hybrid Distill can achieve superior performance on different benchmarks.
翻译:表征学习正处于从传统监督训练向对比学习(CL)和掩码图像建模(MIM)的演进过程中。已有研究表明,二者在不同场景下各有优劣:CL与监督预训练擅长捕获长程全局模式并实现更强的特征判别性,而MIM则能在所有Transformer层引入更局部的多样化注意力。本文旨在探索如何获得兼具二者优势的模型。我们首先审视了先前的特征蒸馏与掩码特征重建方法,并发现其局限性:性能增益主要源于非对称设计,但这种设计可能反过来削弱判别能力。为同时提升判别性与多样性,我们提出简洁高效的混合蒸馏策略——联合监督/CL教师与MIM教师共同指导学生模型。混合蒸馏通过模仿MIM教师的令牌关联性缓解注意力坍塌,同时蒸馏监督/CL教师的特征图增强判别性。此外,我们引入渐进式冗余令牌掩码策略以降低蒸馏开销并避免陷入局部最优。实验结果表明,混合蒸馏在不同基准测试中均取得了优异性能。