Knowledge distillation provides an effective method for deploying complex machine learning models in resource-constrained environments. It typically involves training a smaller student model to emulate either the probabilistic outputs or the internal feature representations of a larger teacher model. By doing so, the student model often achieves substantially better performance on a downstream task compared to when it is trained independently. Nevertheless, the teacher's internal representations can also encode noise or additional information that may not be relevant to the downstream task. This observation motivates our primary question: What are the information-theoretic limits of knowledge transfer? To this end, we leverage a body of work in information theory called Partial Information Decomposition (PID) to quantify the distillable and distilled knowledge of a teacher's representation corresponding to a given student and a downstream task. Moreover, we demonstrate that this metric can be practically used in distillation to address challenges caused by the complexity gap between the teacher and the student representations.
翻译:知识蒸馏为在资源受限环境中部署复杂机器学习模型提供了一种有效方法。该方法通常涉及训练一个较小的学生模型,以模拟较大教师模型的概率输出或内部特征表示。通过这种方式,学生模型在下游任务上的表现通常比其独立训练时显著更优。然而,教师的内部表示也可能编码与下游任务无关的噪声或额外信息。这一观察引出了我们的核心问题:知识传递的信息论极限是什么?为此,我们利用信息论中称为部分信息分解(PID)的理论体系,量化了教师表示相对于给定学生模型及下游任务的可蒸馏知识与已蒸馏知识。此外,我们证明该度量指标可实际应用于蒸馏过程,以解决由师生表示复杂度差异所引发的挑战。