With the rapid development of computer vision, Vision Transformers (ViTs) offer the tantalizing prospect of unified information processing across visual and textual domains. But due to the lack of inherent inductive biases in ViTs, they require enormous amount of data for training. To make their applications practical, we introduce an innovative ensemble-based distillation approach distilling inductive bias from complementary lightweight teacher models. Prior systems relied solely on convolution-based teaching. However, this method incorporates an ensemble of light teachers with different architectural tendencies, such as convolution and involution, to instruct the student transformer jointly. Because of these unique inductive biases, instructors can accumulate a wide range of knowledge, even from readily identifiable stored datasets, which leads to enhanced student performance. Our proposed framework also involves precomputing and storing logits in advance, essentially the unnormalized predictions of the model. This optimization can accelerate the distillation process by eliminating the need for repeated forward passes during knowledge distillation, significantly reducing the computational burden and enhancing efficiency.
翻译:随着计算机视觉的快速发展,视觉Transformer(ViTs)提供了跨视觉和文本领域统一信息处理的前景。但由于ViTs缺乏固有的归纳偏置,它们需要大量数据进行训练。为了使其应用更实用,我们提出了一种创新的基于集成的蒸馏方法,从互补的轻量级教师模型中蒸馏归纳偏置。以往的系统仅依赖基于卷积的教学。然而,本方法引入了一个由不同架构倾向(如卷积和内卷)的轻量级教师组成的集成,共同指导学生Transformer。由于这些独特的归纳偏置,教师能够积累广泛的知识,甚至来自容易识别的存储数据集,从而提升学生性能。我们提出的框架还涉及预先计算并存储logits(即模型的非归一化预测)。这一优化通过消除知识蒸馏过程中重复前向传播的需要,加速了蒸馏过程,显著降低了计算负担并提高了效率。