Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are constrained to be consistent with those of the large model, to transfer task-specific features which guarantee task performance. With G2SD, the vanilla ViT-Small model respectively achieves 98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image classification, object detection, and semantic segmentation, setting a solid baseline for two-stage vision distillation. Code will be available at https://github.com/pengzhiliang/G2SD.
翻译:受自监督预训练机制驱动的大型视觉Transformer(ViT)取得了前所未有的进展。然而,受限于模型容量的小型ViT模型却难以从这些预训练机制中充分获益。知识蒸馏提供了一种将表示从大型(教师)模型迁移到小型(学生)模型的范式。然而,传统的单阶段蒸馏容易陷入任务特定迁移的困境,无法保留对模型泛化至关重要的任务无关知识。在本研究中,我们提出通用到特定蒸馏(G2SD),以挖掘在掩码自编码器预训练的大型模型监督下的小型ViT模型的潜力。在通用蒸馏阶段,我们鼓励小型模型的解码器将其特征预测与大型模型的隐藏表示对齐,从而迁移任务无关的知识。在特定蒸馏阶段,我们约束小型模型的预测与大型模型的预测保持一致,以迁移保证任务性能的任务特定特征。采用G2SD方法,原始ViT-Small模型在图像分类、目标检测和语义分割任务上分别达到了其教师模型(ViT-Base)性能的98.7%、98.1%和99.3%,为两阶段视觉蒸馏建立了坚实的基线。代码将在https://github.com/pengzhiliang/G2SD 提供。