In theoretical ML, the teacher-student paradigm is often employed as an effective metaphor for real-life tuition. The above scheme proves particularly relevant when the student network is overparameterized as compared to the teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network. This latter should be to some extent reminiscent of the frozen teacher structure, according to suitable metrics, while being approximately invariant across different architectures of the student candidate network. Unfortunately, state-of-the-art conventional learning techniques could not help in identifying the existence of such an invariant subnetwork, due to the inherent degree of non-convexity that characterizes the examined problem. In this work, we take a leap forward by proposing a radically different optimization scheme which builds on a spectral representation of the linear transfer of information between layers. The gradient is hence calculated with respect to both eigenvalues and eigenvectors with negligible increase in terms of computational and complexity load, as compared to standard training algorithms. Working in this framework, we could isolate a stable student substructure, that mirrors the true complexity of the teacher in terms of computing neurons, path distribution and topological attributes. When pruning unimportant nodes of the trained student, as follows a ranking that reflects the optimized eigenvalues, no degradation in the recorded performance is seen above a threshold that corresponds to the effective teacher size. The observed behavior can be pictured as a genuine second-order phase transition that bears universality traits.
翻译:在理论机器学习中,师生范式常被用作现实教学中一个有效的隐喻。当学生网络相较于教师网络过度参数化时,上述方案显得尤为相关。在这些操作条件下,人们不禁推测,学生处理给定任务的能力最终可能存储在网络的某个子部分中。根据适当的度量,后者应在某种程度上让人联想到冻结的教师结构,同时在候选学生网络的不同架构中大致保持不变。不幸的是,由于所研究问题固有的非凸性程度,最先进的传统学习技术无法帮助识别这种不变子网络的存在。在这项工作中,我们向前迈出了一大步,提出了一种全新的优化方案,该方案基于层间信息线性传递的谱表示。因此,梯度是针对特征值和特征向量计算的,与标准训练算法相比,计算和复杂度的增加可以忽略不计。在这个框架下工作,我们可以分离出一个稳定的学生子结构,它在计算神经元、路径分布和拓扑属性方面反映了教师的真实复杂性。当根据反映优化特征值的重要性排序,修剪已训练学生的不重要节点时,在达到对应有效教师规模的阈值之前,记录的性能没有任何下降。观察到的行为可以被描绘为一种真正的、具有普遍性特征的第二类相变。