Multi-teacher knowledge distillation as an effective method for compressing ensembles of neural networks

from arxiv, Doctoral dissertation in the field of computer science, machine learning. Application of knowledge distillation as aggregation of ensemble models. Along with several uses. 140 pages, 67 figures, 13 tables

Deep learning has contributed greatly to many successes in artificial intelligence in recent years. Today, it is possible to train models that have thousands of layers and hundreds of billions of parameters. Large-scale deep models have achieved great success, but the enormous computational complexity and gigantic storage requirements make it extremely difficult to implement them in real-time applications. On the other hand, the size of the dataset is still a real problem in many domains. Data are often missing, too expensive, or impossible to obtain for other reasons. Ensemble learning is partially a solution to the problem of small datasets and overfitting. However, ensemble learning in its basic version is associated with a linear increase in computational complexity. We analyzed the impact of the ensemble decision-fusion mechanism and checked various methods of sharing the decisions including voting algorithms. We used the modified knowledge distillation framework as a decision-fusion mechanism which allows in addition compressing of the entire ensemble model into a weight space of a single model. We showed that knowledge distillation can aggregate knowledge from multiple teachers in only one student model and, with the same computational complexity, obtain a better-performing model compared to a model trained in the standard manner. We have developed our own method for mimicking the responses of all teachers at the same time, simultaneously. We tested these solutions on several benchmark datasets. In the end, we presented a wide application use of the efficient multi-teacher knowledge distillation framework. In the first example, we used knowledge distillation to develop models that could automate corrosion detection on aircraft fuselage. The second example describes detection of smoke on observation cameras in order to counteract wildfires in forests.

翻译：深度学习近年来为人工智能领域的诸多成功做出了巨大贡献。如今，训练拥有数千层和数千亿参数的模型已成为可能。大规模深度模型取得了巨大成功，但巨大的计算复杂度和极端的存储需求使其在实时应用中极难实现。另一方面，数据集规模在许多领域仍是一个现实问题。数据常常缺失、过于昂贵或因其他原因无法获取。集成学习在一定程度上是解决小数据集和过拟合问题的一种方案。然而，基础版本的集成学习伴随着计算复杂度的线性增长。我们分析了集成决策融合机制的影响，并考察了包括投票算法在内的多种决策共享方法。我们使用改进的知识蒸馏框架作为决策融合机制，该机制还能将整个集成模型压缩到单个模型的权重空间中。我们证明，知识蒸馏可以将来自多个教师的知识聚合到仅一个学生模型中，并且在相同的计算复杂度下，获得比标准训练方式更优的模型。我们开发了自己的方法，能够同时模仿所有教师的响应。我们在多个基准数据集上测试了这些解决方案。最后，我们展示了高效多教师知识蒸馏框架的广泛应用实例。在第一个示例中，我们使用知识蒸馏开发了能够自动化检测飞机机身腐蚀的模型。第二个示例描述了在观测摄像头上检测烟雾以应对森林野火的方法。