Knowledge distillation (KD) has been recognized as an effective tool to compress and accelerate models. However, current KD approaches generally suffer from an accuracy drop and/or an excruciatingly long distillation process. In this paper, we tackle the issue by first providing a new insight into a phenomenon that we call the Inter-Block Optimization Entanglement (IBOE), which makes the conventional end-to-end KD approaches unstable with noisy gradients. We then propose StableKD, a novel KD framework that breaks the IBOE and achieves more stable optimization. StableKD distinguishes itself through two operations: Decomposition and Recomposition, where the former divides a pair of teacher and student networks into several blocks for separate distillation, and the latter progressively merges them back, evolving towards end-to-end distillation. We conduct extensive experiments on CIFAR100, Imagewoof, and ImageNet datasets with various teacher-student pairs. Compared to other KD approaches, our simple yet effective StableKD greatly boosts the model accuracy by 1% ~ 18%, speeds up the convergence up to 10 times, and outperforms them with only 40% of the training data.
翻译:摘要:知识蒸馏(KD)已被公认为压缩和加速模型的有效工具。然而,当前的KD方法通常存在精度下降和/或蒸馏过程极其漫长的问题。本文通过首次对一种称为“跨块优化纠缠”(IBOE)的现象提供新的见解来解决这一问题,该现象使得传统的端到端KD方法因存在噪声梯度而变得不稳定。我们随后提出StableKD,一种新颖的KD框架,该框架通过分解和重组两种操作打破IBOE,实现更稳定的优化。前者将教师和学生网络对划分为多个块分别进行蒸馏,后者则逐步将其合并,最终演变为端到端蒸馏。我们在CIFAR100、Imagewoof和ImageNet数据集上使用多种教师-学生配对进行了广泛实验。与其他KD方法相比,我们简单而有效的StableKD将模型精度提升了1%~18%,将收敛速度提高了10倍,并且仅使用40%的训练数据便超越了它们。