Knowledge distillation (KD) is a model compression method that entails training a compact student model to emulate the performance of a more complex teacher model. However, the architectural capacity gap between the two models limits the effectiveness of knowledge transfer. Addressing this issue, previous works focused on customizing teacher-student pairs to improve compatibility, a computationally expensive process that needs to be repeated every time either model changes. Hence, these methods are impractical when a teacher model has to be compressed into different student models for deployment on multiple hardware devices with distinct resource constraints. In this work, we propose Generic Teacher Network (GTN), a one-off KD-aware training to create a generic teacher capable of effectively transferring knowledge to any student model sampled from a given finite pool of architectures. To this end, we represent the student pool as a weight-sharing supernet and condition our generic teacher to align with the capacities of various student architectures sampled from this supernet. Experimental evaluation shows that our method both improves overall KD effectiveness and amortizes the minimal additional training cost of the generic teacher across students in the pool.
翻译:知识蒸馏(KD)是一种模型压缩方法,旨在训练紧凑的学生模型以模仿更复杂的教师模型的性能。然而,两种模型之间的架构容量差距限制了知识转移的有效性。针对这一问题,先前的研究集中于定制教师-学生配对以提高兼容性,这一计算成本高昂的过程需要在任一模型发生变化时重复进行。因此,当教师模型需要被压缩为不同的学生模型以部署在具有不同资源约束的多种硬件设备上时,这些方法并不实用。在本工作中,我们提出通用教师网络(GTN),这是一种一次性KD感知训练方法,用于创建一个能够有效将知识转移至从给定有限架构池中采样的任何学生模型的通用教师。为此,我们将学生池表示为一个权重共享的超网,并调整我们的通用教师以与从该超网采样的各种学生架构的容量对齐。实验评估表明,我们的方法既提高了整体KD的有效性,又能在池中学生间分摊通用教师的最小额外训练成本。