In this paper, we tackle a new problem: how to transfer knowledge from the pre-trained cumbersome yet well-performed CNN-based model to learn a compact Vision Transformer (ViT)-based model while maintaining its learning capacity? Due to the completely different characteristics of ViT and CNN and the long-existing capacity gap between teacher and student models in Knowledge Distillation (KD), directly transferring the cross-model knowledge is non-trivial. To this end, we subtly leverage the visual and linguistic-compatible feature character of ViT (i.e., student), and its capacity gap with the CNN (i.e., teacher) and propose a novel CNN-to-ViT KD framework, dubbed C2VKD. Importantly, as the teacher's features are heterogeneous to those of the student, we first propose a novel visual-linguistic feature distillation (VLFD) module that explores efficient KD among the aligned visual and linguistic-compatible representations. Moreover, due to the large capacity gap between the teacher and student and the inevitable prediction errors of the teacher, we then propose a pixel-wise decoupled distillation (PDD) module to supervise the student under the combination of labels and teacher's predictions from the decoupled target and non-target classes. Experiments on three semantic segmentation benchmark datasets consistently show that the increment of mIoU of our method is over 200% of the SoTA KD methods
翻译:本文探讨了一个新问题:如何将预训练的性能优越但计算量大的CNN模型的知识迁移至紧凑的视觉Transformer(ViT)模型,同时保持其学习能力?由于ViT与CNN的特性截然不同,且知识蒸馏中师生模型间长期存在能力差距,直接进行跨模型知识迁移并非易事。为此,我们巧妙利用ViT(学生模型)与视觉及语言兼容特征的特性,及其与CNN(教师模型)之间的能力差距,提出了一种新颖的CNN到ViT的知识蒸馏框架,命名为C2VKD。关键在于,由于教师模型的特征与学生模型的特征异质,我们首先提出一种视觉-语言特征蒸馏模块,该模块在对齐的视觉及语言兼容表征中探索高效的知识蒸馏。此外,针对师生模型间的巨大能力差距以及教师模型不可避免的预测误差,我们提出一种逐像素解耦蒸馏模块,在标签与从解耦目标类和非目标类获得的教师预测的联合监督下指导学生模型。在三个语义分割基准数据集上的实验一致表明,我们方法的mIoU提升幅度超过当前最优知识蒸馏方法的200%。