Knowledge Distillation (KD) is essential in transferring dark knowledge from a large teacher to a small student network, such that the student can be much more efficient than the teacher but with comparable accuracy. Existing KD methods, however, rely on a large teacher trained specifically for the target task, which is both very inflexible and inefficient. In this paper, we argue that a SSL-pretrained model can effectively act as the teacher and its dark knowledge can be captured by the coordinate system or linear subspace where the features lie in. We then need only one forward pass of the teacher, and then tailor the coordinate system (TCS) for the student network. Our TCS method is teacher-free and applies to diverse architectures, works well for KD and practical few-shot learning, and allows cross-architecture distillation with large capacity gap. Experiments show that TCS achieves significantly higher accuracy than state-of-the-art KD methods, while only requiring roughly half of their training time and GPU memory costs.
翻译:知识蒸馏(KD)对于将暗知识从大型教师网络迁移到小型学生网络至关重要,使得学生网络在保持可比精度的同时,比教师网络高效得多。然而,现有的KD方法依赖于专门为目标任务训练的大型教师网络,这既非常不灵活又低效。在本文中,我们提出,一个经过自监督预训练的模型可以有效地充当教师,其暗知识可以通过特征所在的坐标系或线性子空间来捕获。因此,我们只需要教师网络的一次前向传播,然后为学生网络定制一个坐标系(TCS)。我们的TCS方法无需专门的教师网络,适用于多种架构,在KD和实际少样本学习中表现良好,并且允许存在较大容量差距的跨架构蒸馏。实验表明,TCS在仅需约一半训练时间和GPU内存成本的情况下,达到了比最先进的KD方法显著更高的精度。