How a student becomes a teacher: learning and forgetting through Spectral methods

In theoretical ML, the teacher-student paradigm is often employed as an effective metaphor for real-life tuition. The above scheme proves particularly relevant when the student network is overparameterized as compared to the teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network. This latter should be to some extent reminiscent of the frozen teacher structure, according to suitable metrics, while being approximately invariant across different architectures of the student candidate network. Unfortunately, state-of-the-art conventional learning techniques could not help in identifying the existence of such an invariant subnetwork, due to the inherent degree of non-convexity that characterizes the examined problem. In this work, we take a leap forward by proposing a radically different optimization scheme which builds on a spectral representation of the linear transfer of information between layers. The gradient is hence calculated with respect to both eigenvalues and eigenvectors with negligible increase in terms of computational and complexity load, as compared to standard training algorithms. Working in this framework, we could isolate a stable student substructure, that mirrors the true complexity of the teacher in terms of computing neurons, path distribution and topological attributes. When pruning unimportant nodes of the trained student, as follows a ranking that reflects the optimized eigenvalues, no degradation in the recorded performance is seen above a threshold that corresponds to the effective teacher size. The observed behavior can be pictured as a genuine second-order phase transition that bears universality traits.

翻译：在理论机器学习中，师生范式常被用作现实教学中一个有效的隐喻。当学生网络相较于教师网络过度参数化时，上述方案显得尤为相关。在这些操作条件下，人们不禁推测，学生处理给定任务的能力最终可能存储在网络的某个子部分中。根据适当的度量，后者应在某种程度上让人联想到冻结的教师结构，同时在候选学生网络的不同架构中大致保持不变。不幸的是，由于所研究问题固有的非凸性程度，最先进的传统学习技术无法帮助识别这种不变子网络的存在。在这项工作中，我们向前迈出了一大步，提出了一种全新的优化方案，该方案基于层间信息线性传递的谱表示。因此，梯度是针对特征值和特征向量计算的，与标准训练算法相比，计算和复杂度的增加可以忽略不计。在这个框架下工作，我们可以分离出一个稳定的学生子结构，它在计算神经元、路径分布和拓扑属性方面反映了教师的真实复杂性。当根据反映优化特征值的重要性排序，修剪已训练学生的不重要节点时，在达到对应有效教师规模的阈值之前，记录的性能没有任何下降。观察到的行为可以被描绘为一种真正的、具有普遍性特征的第二类相变。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日