Spatially heterogeneous learning by a deep student machine

Despite the spectacular successes, deep neural networks (DNN) with a huge number of adjustable parameters remain largely black boxes. To shed light on the hidden layers of DNN, we study supervised learning by a DNN of width $N$ and depth $L$ consisting of perceptrons with $c$ inputs by a statistical mechanics approach called the teacher-student setting. We consider an ensemble of student machines that exactly reproduce $M$ sets of $N$ dimensional input/output relations provided by a teacher machine. We analyze the ensemble theoretically using a replica method (H. Yoshino (2020)) and numerically performing greedy Monte Carlo simulations. The replica theory which works on high dimensional data $N \gg 1$ becomes exact in 'dense limit' $N \gg c \gg 1$ and $M \gg 1$ with fixed $\alpha=M/c$. Both the theory and the simulation suggest learning by the DNN is quite heterogeneous in the network space: configurations of the machines are more correlated within the layers closer to the input/output boundaries while the central region remains much less correlated due to over-parametrization. Deep enough systems relax faster thanks to the less correlated central region. Remarkably both the theory and simulation suggest generalization-ability of the student machines does not vanish even in the deep limit $L \gg 1$ where the system becomes strongly over-parametrized. We also consider the impact of effective dimension $D(\leq N)$ of data by incorporating the hidden manifold model (S. Goldt et al (2020)) into our model. The replica theory implies that the loop corrections to the dense limit, which reflect correlations between different nodes in the network, become enhanced by either decreasing the width $\ N$ or decreasing the effective dimension $D$ of the data. Simulation suggests both leads to significant improvements in generalization-ability.

翻译：尽管取得了令人瞩目的成功，但拥有海量可调参数的深度神经网络（DNN）在很大程度上仍是一个黑箱。为了揭示DNN隐藏层的奥秘，我们采用一种称为“教师-学生”设定的统计力学方法，研究由一个宽度为$N$、深度为$L$、由感知器（每个感知器有$c$个输入）构成的DNN所进行的监督学习。我们考虑一组学生机器，它们精确地重现由教师机器提供的$M$组$N$维输入/输出关系。我们使用复制方法（H. Yoshino (2020)）从理论上分析这个集成系统，并通过贪婪蒙特卡洛模拟进行数值研究。该复制理论适用于高维数据（$N \gg 1$），在固定$\alpha=M/c$的“稠密极限”$N \gg c \gg 1$和$M \gg 1$下变得精确。理论与模拟均表明，DNN的学习在网络空间中具有高度的异质性：靠近输入/输出边界的层内，机器构型之间的相关性更强，而中心区域由于过度参数化，相关性则要弱得多。得益于相关性较低的中心区域，足够深的系统收敛速度更快。值得注意地，理论与模拟均表明，即使在系统变得高度过度参数化的深层极限$L \gg 1$下，学生机器的泛化能力也不会消失。我们还通过将隐藏流形模型（S. Goldt等人 (2020)）纳入我们的模型，考虑了数据有效维度$D(\leq N)$的影响。复制理论表明，稠密极限的回路修正（反映网络中不同节点之间的相关性）会因宽度$N$的减小或数据有效维度$D$的减小而增强。模拟结果表明，这两种情况都能显著提升泛化能力。