Knowledge distillation is an effective paradigm for boosting the performance of pocket-size model, especially when multiple teacher models are available, the student would break the upper limit again. However, it is not economical to train diverse teacher models for the disposable distillation. In this paper, we introduce a new concept dubbed Avatars for distillation, which are the inference ensemble models derived from the teacher. Concretely, (1) For each iteration of distillation training, various Avatars are generated by a perturbation transformation. We validate that Avatars own higher upper limit of working capacity and teaching ability, aiding the student model in learning diverse and receptive knowledge perspectives from the teacher model. (2) During the distillation, we propose an uncertainty-aware factor from the variance of statistical differences between the vanilla teacher and Avatars, to adjust Avatars' contribution on knowledge transfer adaptively. Avatar Knowledge Distillation AKD is fundamentally different from existing methods and refines with the innovative view of unequal training. Comprehensive experiments demonstrate the effectiveness of our Avatars mechanism, which polishes up the state-of-the-art distillation methods for dense prediction without more extra computational cost. The AKD brings at most 0.7 AP gains on COCO 2017 for Object Detection and 1.83 mIoU gains on Cityscapes for Semantic Segmentation, respectively.
翻译:知识蒸馏是提升轻量级模型性能的有效范式,尤其在多个教师模型可用时,学生模型可突破性能上限。然而,为一次性蒸馏训练多样化的教师模型并不经济。本文提出一种名为“阿凡达”(Avatars)的新概念——即从教师模型中衍生出的推理集成模型。具体而言:(1)在蒸馏训练的每次迭代中,通过扰动变换生成多种阿凡达。我们验证了阿凡达具有更高的工作能力上限和教学能力,有助于学生模型从教师模型中学习多样且包容的知识视角;(2)在蒸馏过程中,我们基于原始教师与阿凡达之间统计差异的方差,提出一种不确定性感知因子,以自适应调整阿凡达在知识迁移中的贡献。阿凡达知识蒸馏(AKD)在本质上区别于现有方法,并通过非均衡训练的创新视角进行优化。大量实验证明了我们阿凡达机制的有效性:该方法在不增加额外计算成本的前提下,改进了当前最先进的密集预测蒸馏方法。在COCO 2017目标检测任务上,AKD带来最高0.7 AP的提升;在Cityscapes语义分割任务上,带来1.83 mIoU的提升。