Few-shot knowledge distillation recently emerged as a viable approach to harness the knowledge of large-scale pre-trained models, using limited data and computational resources. In this paper, we propose a novel few-shot feature distillation approach for vision transformers. Our approach is based on two key steps. Leveraging the fact that vision transformers have a consistent depth-wise structure, we first copy the weights from intermittent layers of existing pre-trained vision transformers (teachers) into shallower architectures (students), where the intermittence factor controls the complexity of the student transformer with respect to its teacher. Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario, aiming to recover the information processing carried out by the skipped teacher layers. We present comprehensive experiments with supervised and self-supervised transformers as teachers, on six data sets from various domains (natural, medical and satellite images) and tasks (classification and segmentation). The empirical results confirm the superiority of our approach over state-of-the-art competitors. Moreover, the ablation results demonstrate the usefulness of each component of the proposed pipeline. We release our code at https://github.com/dianagrigore/WeCoLoRA.
翻译:小样本知识蒸馏作为一种利用大规模预训练模型知识的方法,近期在有限数据和计算资源条件下展现出可行性。本文提出一种面向视觉Transformer的新型小样本特征蒸馏方法。该方法基于两个关键步骤:首先利用视觉Transformer具有一致深度结构的特点,将现有预训练视觉Transformer(教师模型)中间歇层的权重复制至更浅层架构(学生模型)中,其中间歇因子控制学生Transformer相对于教师模型的复杂度;随后采用增强版低秩自适应(LoRA)技术,在小样本场景下将知识蒸馏至学生模型,旨在恢复被跳过的教师层所执行的信息处理过程。我们在六个跨领域数据集(自然图像、医学图像及卫星图像)及多任务(分类与分割)上,分别以监督式与自监督式Transformer作为教师模型进行了全面实验。实证结果证实了本方法相较于现有先进方案的优越性。此外,消融实验结果验证了所提流程各组成部分的有效性。代码已发布于https://github.com/dianagrigore/WeCoLoRA。