Vision Transformers (ViTs) have achieved significant advancement in computer vision tasks due to their powerful modeling capacity. However, their performance notably degrades when trained with insufficient data due to lack of inherent inductive biases. Distilling knowledge and inductive biases from a Convolutional Neural Network (CNN) teacher has emerged as an effective strategy for enhancing the generalization of ViTs on limited datasets. Previous approaches to Knowledge Distillation (KD) have pursued two primary paths: some focused solely on distilling the logit distribution from CNN teacher to ViT student, neglecting the rich semantic information present in intermediate features due to the structural differences between them. Others integrated feature distillation along with logit distillation, yet this introduced alignment operations that limits the amount of knowledge transferred due to mismatched architectures and increased the computational overhead. To this end, this paper presents Hybrid Data-efficient Knowledge Distillation (HDKD) paradigm which employs a CNN teacher and a hybrid student. The choice of hybrid student serves two main aspects. First, it leverages the strengths of both convolutions and transformers while sharing the convolutional structure with the teacher model. Second, this shared structure enables the direct application of feature distillation without any information loss or additional computational overhead. Additionally, we propose an efficient light-weight convolutional block named Mobile Channel-Spatial Attention (MBCSA), which serves as the primary convolutional block in both teacher and student models. Extensive experiments on two medical public datasets showcase the superiority of HDKD over other state-of-the-art models and its computational efficiency. Source code at: https://github.com/omarsherif200/HDKD
翻译:视觉Transformer(ViT)凭借其强大的建模能力,在计算机视觉任务中取得了显著进展。然而,由于缺乏固有的归纳偏置,在数据不足的情况下训练时,其性能会显著下降。从卷积神经网络(CNN)教师模型中蒸馏知识和归纳偏置,已成为提升ViT在有限数据集上泛化能力的有效策略。以往的知识蒸馏(KD)方法主要遵循两条路径:一些方法仅专注于将CNN教师模型的逻辑分布蒸馏到ViT学生模型,但由于两者结构差异,忽略了中间特征中丰富的语义信息。另一些方法则将特征蒸馏与逻辑蒸馏相结合,但这引入了对齐操作,由于架构不匹配限制了知识转移量,并增加了计算开销。为此,本文提出了混合数据高效知识蒸馏(HDKD)范式,该范式采用一个CNN教师模型和一个混合学生模型。选择混合学生模型主要基于两方面考虑:首先,它同时利用了卷积和Transformer的优势,并与教师模型共享卷积结构;其次,这种共享结构使得能够直接应用特征蒸馏,而不会造成信息损失或产生额外的计算开销。此外,我们提出了一种高效的轻量级卷积模块,称为移动通道-空间注意力(MBCSA),它作为教师和学生模型中的主要卷积模块。在两个医学公开数据集上进行的大量实验表明,HDKD优于其他最先进的模型,并具有计算高效性。源代码位于:https://github.com/omarsherif200/HDKD