Knowledge distillation is an effective paradigm for boosting the performance of pocket-size model, especially when multiple teacher models are available, the student would break the upper limit again. However, it is not economical to train diverse teacher models for the disposable distillation. In this paper, we introduce a new concept dubbed Avatars for distillation, which are the inference ensemble models derived from the teacher. Concretely, (1) For each iteration of distillation training, various Avatars are generated by a perturbation transformation. We validate that Avatars own higher upper limit of working capacity and teaching ability, aiding the student model in learning diverse and receptive knowledge perspectives from the teacher model. (2) During the distillation, we propose an uncertainty-aware factor from the variance of statistical differences between the vanilla teacher and Avatars, to adjust Avatars' contribution on knowledge transfer adaptively. Avatar Knowledge Distillation AKD is fundamentally different from existing methods and refines with the innovative view of unequal training. Comprehensive experiments demonstrate the effectiveness of our Avatars mechanism, which polishes up the state-of-the-art distillation methods for dense prediction without more extra computational cost. The AKD brings at most 0.7 AP gains on COCO 2017 for Object Detection and 1.83 mIoU gains on Cityscapes for Semantic Segmentation, respectively.
翻译:知识蒸馏是提升轻量级模型性能的有效范式,尤其在多个教师模型可用时,学生模型能够再次突破性能上限。然而,为一次性蒸馏训练多样化的教师模型并不经济。本文提出一种名为"分身"(Avatars)的新概念,这些分身是从教师模型派生的推理集成模型。具体而言:(1)在蒸馏训练的每次迭代中,通过扰动变换生成多种分身。我们验证了分身具有更高的工作能力上限和教学能力,有助于学生模型从教师模型学习多样且包容的知识视角。(2)在蒸馏过程中,我们根据原始教师模型与分身之间统计差异的方差,提出一种不确定性感知因子,以自适应调节分身对知识迁移的贡献。分身知识蒸馏(AKD)与现有方法有根本性不同,其通过非平等训练的创新视角进行优化。大量实验证明了我们分身机制的有效性,在不增加额外计算成本的情况下,该方法提升了现有最先进的密集预测蒸馏方法的性能。AKD在COCO 2017目标检测数据集上带来最多0.7 AP的提升,在Cityscapes语义分割数据集上带来1.83 mIoU的提升。