Knowledge Distillation (KD) is essential in transferring dark knowledge from a large teacher to a small student network, such that the student can be much more efficient than the teacher but with comparable accuracy. Existing KD methods, however, rely on a large teacher trained specifically for the target task, which is both very inflexible and inefficient. In this paper, we argue that a SSL-pretrained model can effectively act as the teacher and its dark knowledge can be captured by the coordinate system or linear subspace where the features lie in. We then need only one forward pass of the teacher, and then tailor the coordinate system (TCS) for the student network. Our TCS method is teacher-free and applies to diverse architectures, works well for KD and practical few-shot learning, and allows cross-architecture distillation with large capacity gap. Experiments show that TCS achieves significantly higher accuracy than state-of-the-art KD methods, while only requiring roughly half of their training time and GPU memory costs.
翻译:知识蒸馏(KD)在将暗知识从大型教师网络迁移至小型学生网络方面至关重要,使得学生网络能够比教师网络高效得多,同时保持相当的准确性。然而,现有的KD方法依赖于针对目标任务专门训练的大型教师网络,这既非常不灵活又效率低下。在本文中,我们主张一个自监督预训练模型可以有效地充当教师,其暗知识可以通过特征所在的坐标系或线性子空间来捕获。我们随后仅需教师网络的一次前向传播,然后为学生网络定制该坐标系(TCS)。我们的TCS方法无需特定教师网络,适用于多种架构,在KD和实际少样本学习中表现良好,并允许在存在较大容量差距的情况下进行跨架构蒸馏。实验表明,TCS比最先进的KD方法实现了显著更高的准确性,同时仅需大约一半的训练时间和GPU内存成本。