Few-shot knowledge distillation recently emerged as a viable approach to harness the knowledge of large-scale pre-trained models, using limited data and computational resources. In this paper, we propose a novel few-shot feature distillation approach for vision transformers. Our approach is based on two key steps. Leveraging the fact that vision transformers have a consistent depth-wise structure, we first copy the weights from intermittent layers of existing pre-trained vision transformers (teachers) into shallower architectures (students), where the intermittence factor controls the complexity of the student transformer with respect to its teacher. Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario, aiming to recover the information processing carried out by the skipped teacher layers. We present comprehensive experiments with supervised and self-supervised transformers as teachers, on five data sets from various domains, including natural, medical and satellite images. The empirical results confirm the superiority of our approach over competitive baselines. Moreover, the ablation results demonstrate the usefulness of each component of the proposed pipeline.
翻译:摘要:小样本知识蒸馏近期作为一种利用有限数据和计算资源来挖掘大规模预训练模型知识的可行方法崭露头角。本文提出一种新颖的视觉Transformer小样本特征蒸馏方法。该方法基于两个关键步骤:首先,利用视觉Transformer具有一致的深度方向结构这一特性,将现有预训练视觉Transformer(教师网络)中间层的权重复制到较浅架构(学生网络)中,其中间歇因子控制学生Transformer相对于教师的复杂度;其次,采用增强版低秩适配(LoRA)在小样本场景下向学生网络蒸馏知识,旨在恢复被跳过的教师层级所执行的信息处理过程。我们以监督式和自监督式Transformer作为教师网络,在涵盖自然图像、医学图像和卫星图像等领域的五个数据集上开展全面实验。实证结果证实了本方法相较于竞争基线的优越性。此外,消融实验也证明了所提流水线各组件的有效性。