Distilling knowledge from convolutional neural networks (CNNs) is a double-edged sword for vision transformers (ViTs). It boosts the performance since the image-friendly local-inductive bias of CNN helps ViT learn faster and better, but leading to two problems: (1) Network designs of CNN and ViT are completely different, which leads to different semantic levels of intermediate features, making spatial-wise knowledge transfer methods (e.g., feature mimicking) inefficient. (2) Distilling knowledge from CNN limits the network convergence in the later training period since ViT's capability of integrating global information is suppressed by CNN's local-inductive-bias supervision. To this end, we present Cumulative Spatial Knowledge Distillation (CSKD). CSKD distills spatial-wise knowledge to all patch tokens of ViT from the corresponding spatial responses of CNN, without introducing intermediate features. Furthermore, CSKD exploits a Cumulative Knowledge Fusion (CKF) module, which introduces the global response of CNN and increasingly emphasizes its importance during the training. Applying CKF leverages CNN's local inductive bias in the early training period and gives full play to ViT's global capability in the later one. Extensive experiments and analysis on ImageNet-1k and downstream datasets demonstrate the superiority of our CSKD. Code will be publicly available.
翻译:从卷积神经网络(CNN)中蒸馏知识对视觉Transformer(ViT)而言是一把双刃剑。由于CNN的图像友好局部归纳偏置有助于ViT更快更好地学习,这提升了其性能,但也导致了两个问题:(1)CNN与ViT的网络设计完全不同,导致中间特征语义层次存在差异,使得空间维度的知识转移方法(如特征模仿)效率低下。(2)从CNN蒸馏知识会限制ViT在后期训练阶段的网络收敛,因为ViT整合全局信息的能力被CNN的局部归纳偏置监督所抑制。为此,我们提出累积空间知识蒸馏(CSKD)。CSKD从CNN的对应空间响应中向ViT的所有图像块令牌蒸馏空间维度的知识,而无需引入中间特征。此外,CSKD利用累积知识融合(CKF)模块,该模块引入CNN的全局响应并在训练过程中逐步增强其重要性。应用CKF可在训练早期利用CNN的局部归纳偏置,并在训练后期充分发挥ViT的全局能力。在ImageNet-1k及下游数据集上的大量实验与分析证明了CSKD的优越性。代码将公开提供。