Knowledge transfer is shown to be a very successful technique for training neural classifiers: together with the ground truth data, it uses the "privileged information" (PI) obtained by a "teacher" network to train a "student" network. It has been observed that classifiers learn much faster and more reliably via knowledge transfer. However, there has been little or no theoretical analysis of this phenomenon. To bridge this gap, we propose to approach the problem of knowledge transfer by regularizing the fit between the teacher and the student with PI provided by the teacher. Using tools from dynamical systems theory, we show that when the student is an extremely wide two layer network, we can analyze it in the kernel regime and show that it is able to interpolate between PI and the given data. This characterization sheds new light on the relation between the training error and capacity of the student relative to the teacher. Another contribution of the paper is a quantitative statement on the convergence of student network. We prove that the teacher reduces the number of required iterations for a student to learn, and consequently improves the generalization power of the student. We give corresponding experimental analysis that validates the theoretical results and yield additional insights.
翻译:知识迁移已被证明是训练神经分类器的一种非常成功的技术:它结合真实数据,利用“教师”网络获得的“特权信息”来训练“学生”网络。观察表明,通过知识迁移,分类器学习得更快且更可靠。然而,目前对这一现象的理论分析极少甚至空白。为弥补这一不足,我们提出通过正则化教师与学生之间基于教师提供的特权信息的拟合来研究知识迁移问题。利用动力系统理论工具,我们证明当学生网络为一个极宽的双层网络时,可以在核机制下对其进行分析,并表明它能够在特权信息与给定数据之间进行插值。这一描述为训练误差与学生相对于教师的能力之间的关系提供了新的见解。本文的另一个贡献是关于学生网络收敛性的定量表述。我们证明教师减少了学生学习所需的迭代次数,从而提高了学生的泛化能力。我们给出了相应的实验分析,验证了理论结果并提供了额外见解。