How a student becomes a teacher: learning and forgetting through Spectral methods

In theoretical ML, the teacher-student paradigm is often employed as an effective metaphor for real-life tuition. The above scheme proves particularly relevant when the student network is overparameterized as compared to the teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network. This latter should be to some extent reminiscent of the frozen teacher structure, according to suitable metrics, while being approximately invariant across different architectures of the student candidate network. Unfortunately, state-of-the-art conventional learning techniques could not help in identifying the existence of such an invariant subnetwork, due to the inherent degree of non-convexity that characterizes the examined problem. In this work, we take a leap forward by proposing a radically different optimization scheme which builds on a spectral representation of the linear transfer of information between layers. The gradient is hence calculated with respect to both eigenvalues and eigenvectors with negligible increase in terms of computational and complexity load, as compared to standard training algorithms. Working in this framework, we could isolate a stable student substructure, that mirrors the true complexity of the teacher in terms of computing neurons, path distribution and topological attributes. When pruning unimportant nodes of the trained student, as follows a ranking that reflects the optimized eigenvalues, no degradation in the recorded performance is seen above a threshold that corresponds to the effective teacher size. The observed behavior can be pictured as a genuine second-order phase transition that bears universality traits.

翻译：在理论机器学习中，师生范式常被用作现实教学的有效隐喻。当学生网络相对于教师网络存在过度参数化时，上述方案显得尤为相关。在此运行条件下，自然推测学生处理给定任务的能力最终可能存储在整个网络的子部分中。根据合适的度量标准，该子部分应在某种程度上有助于记忆冻结的教师结构，同时在学生候选网络的不同架构下近似不变。遗憾的是，由于所研究问题固有的非凸性，最先进的传统学习技术无法帮助识别此类不变子网络的存在。在本工作中，我们提出一种根本不同的优化方案，该方案基于层间信息线性传递的谱表示，从而迈出突破性一步。梯度因此针对特征值和特征向量进行计算，与标准训练算法相比，其计算和复杂度开销的增加可忽略不计。在此框架下，我们能够分离出一个稳定的学生子结构，该结构在计算神经元数量、路径分布和拓扑属性方面真实反映了教师的复杂度。当根据反映优化特征值的排序剪枝已训练学生网络中不重要的节点时，在记录性能上未观察到高于有效教师规模阈值的退化。所观察到的行为可被描绘为一种具有普适性特征的真正二级相变。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日