Teacher-Student Architecture for Knowledge Distillation: A Survey

Although Deep neural networks (DNNs) have shown a strong capacity to solve large-scale problems in many areas, such DNNs are hard to be deployed in real-world systems due to their voluminous parameters. To tackle this issue, Teacher-Student architectures were proposed, where simple student networks with a few parameters can achieve comparable performance to deep teacher networks with many parameters. Recently, Teacher-Student architectures have been effectively and widely embraced on various knowledge distillation (KD) objectives, including knowledge compression, knowledge expansion, knowledge adaptation, and knowledge enhancement. With the help of Teacher-Student architectures, current studies are able to achieve multiple distillation objectives through lightweight and generalized student networks. Different from existing KD surveys that primarily focus on knowledge compression, this survey first explores Teacher-Student architectures across multiple distillation objectives. This survey presents an introduction to various knowledge representations and their corresponding optimization objectives. Additionally, we provide a systematic overview of Teacher-Student architectures with representative learning algorithms and effective distillation schemes. This survey also summarizes recent applications of Teacher-Student architectures across multiple purposes, including classification, recognition, generation, ranking, and regression. Lastly, potential research directions in KD are investigated, focusing on architecture design, knowledge quality, and theoretical studies of regression-based learning, respectively. Through this comprehensive survey, industry practitioners and the academic community can gain valuable insights and guidelines for effectively designing, learning, and applying Teacher-Student architectures on various distillation objectives.

翻译：尽管深度神经网络在多个领域展现出解决大规模问题的强大能力，但由于其参数庞大，难以部署到实际系统中。为解决这一问题，研究者提出了教师-学生架构，其中参数较少的简单学生网络能够达到与参数众多的深层教师网络相媲美的性能。近年来，教师-学生架构已有效且广泛地应用于多种知识蒸馏目标，包括知识压缩、知识扩展、知识适应和知识增强。借助教师-学生架构，当前研究能够通过轻量级且泛化能力强的学生网络实现多种蒸馏目标。与现有主要关注知识压缩的蒸馏综述不同，本综述首次在多蒸馏目标背景下探讨教师-学生架构。本文介绍了多种知识表示及其对应的优化目标，系统概述了具有代表性学习算法和高效蒸馏方案的教师-学生架构，并总结了教师-学生架构在分类、识别、生成、排序和回归等多类任务中的最新应用。最后，本文探讨了知识蒸馏的潜在研究方向，特别关注架构设计、知识质量以及基于回归学习的理论研究。通过本综述，工业从业者与学术界能够获得关于如何在多种蒸馏目标上有效设计、学习与应用教师-学生架构的宝贵见解与指导。