Knowledge distillation is an effective paradigm for boosting the performance of pocket-size model, especially when multiple teacher models are available, the student would break the upper limit again. However, it is not economical to train diverse teacher models for the disposable distillation. In this paper, we introduce a new concept dubbed Avatars for distillation, which are the inference ensemble models derived from the teacher. Concretely, (1) For each iteration of distillation training, various Avatars are generated by a perturbation transformation. We validate that Avatars own higher upper limit of working capacity and teaching ability, aiding the student model in learning diverse and receptive knowledge perspectives from the teacher model. (2) During the distillation, we propose an uncertainty-aware factor from the variance of statistical differences between the vanilla teacher and Avatars, to adjust Avatars' contribution on knowledge transfer adaptively. Avatar Knowledge Distillation AKD is fundamentally different from existing methods and refines with the innovative view of unequal training. Comprehensive experiments demonstrate the effectiveness of our Avatars mechanism, which polishes up the state-of-the-art distillation methods for dense prediction without more extra computational cost. The AKD brings at most 0.7 AP gains on COCO 2017 for Object Detection and 1.83 mIoU gains on Cityscapes for Semantic Segmentation, respectively. Code is available at https://github.com/Gumpest/AvatarKD.
翻译:知识蒸馏是提升紧凑型模型性能的有效范式,尤其在多教师模型可用时,学生模型能够突破性能上限。然而,为一次性蒸馏训练多样化教师模型并不经济。本文提出名为"化身"(Avatars)的新概念用于蒸馏,即从教师模型派生的推理集成模型。具体而言:(1)在蒸馏训练的每次迭代中,通过扰动变换生成不同化身。我们验证了化身具有更高的工作能力上限与教学能力,可辅助学生模型从教师模型学习多样且全面的知识维度。(2)蒸馏过程中,我们基于原始教师与化身之间统计差异的方差提出不确定性因子,以自适应调整化身在知识迁移中的贡献。化身知识蒸馏(Avatar Knowledge Distillation, AKD)从根本上区别于现有方法,并以非均衡训练的创新视角进行优化。大量实验证明,我们的化身机制在不增加额外计算成本的前提下,有效提升了现有密集预测蒸馏方法的性能。AKD在COCO 2017目标检测任务上带来最高0.7 AP提升,在Cityscapes语义分割任务上带来1.83 mIoU提升。代码开源于https://github.com/Gumpest/AvatarKD。