Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Any continuous function $f^*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f^*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with $k$ neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when $n-1$ student neurons each copy one teacher neuron and the $n$-th student neuron averages the remaining $k-n+1$ teacher neurons. For the student network with $n=1$ neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.

翻译：任何连续函数$f^*$都可以通过具有足够多神经元$k$的神经网络任意逼近。我们考虑$f^*$本身是具有一个隐藏层和$k$个神经元的神经网络的情况。因此，用具有$n<k$个神经元的网络逼近$f^*$可以视为将欠参数化的"学生"网络（含$n$个神经元）拟合到"教师"网络（含$k$个神经元）。由于学生神经元少于教师神经元，每个学生神经元是应复制一个教师神经元还是平均一组教师神经元尚不明确。对于具有erf激活函数和标准高斯输入分布的浅层神经网络，我们证明：若教师网络的输入向量正交且输出权值为单位矩阵，则"复制-平均"配置是临界点。此外，此类配置中的最优解出现在$n-1$个学生神经元各自复制一个教师神经元，而第$n$个学生神经元平均剩余$k-n+1$个教师神经元时。对于$n=1$个神经元的学生网络，我们进一步通过求解等价约束优化问题，给出了常用激活函数下非平凡临界点的闭式解。实验发现：对于erf激活函数，梯度流收敛至最优复制-平均临界点，或收敛至每个学生神经元近似复制不同教师神经元的其他点。最后，我们在ReLU激活函数下观察到类似结果，表明欠参数化网络的最优解具有普适结构。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日