Spatially heterogeneous learning by a deep student machine

Despite the spectacular successes, deep neural networks (DNN) with a huge number of adjustable parameters remain largely black boxes. To shed light on the hidden layers of DNN, we study supervised learning by a DNN of width $N$ and depth $L$ consisting of perceptrons with $c$ inputs by a statistical mechanics approach called the teacher-student setting. We consider an ensemble of student machines that exactly reproduce $M$ sets of $N$ dimensional input/output relations provided by a teacher machine. We analyze the ensemble theoretically using a replica method (H. Yoshino (2020)) and numerically performing greedy Monte Carlo simulations. The replica theory which works on high dimensional data $N \gg 1$ becomes exact in 'dense limit' $N \gg c \gg 1$ and $M \gg 1$ with fixed $\alpha=M/c$. Both the theory and the simulation suggest learning by the DNN is quite heterogeneous in the network space: configurations of the machines are more correlated within the layers closer to the input/output boundaries while the central region remains much less correlated due to over-parametrization. Deep enough systems relax faster thanks to the less correlated central region. Remarkably both the theory and simulation suggest generalization-ability of the student machines does not vanish even in the deep limit $L \gg 1$ where the system becomes strongly over-parametrized. We also consider the impact of effective dimension $D(\leq N)$ of data by incorporating the hidden manifold model (S. Goldt et al (2020)) into our model. The replica theory implies that the loop corrections to the dense limit, which reflect correlations between different nodes in the network, become enhanced by either decreasing the width $\ N$ or decreasing the effective dimension $D$ of the data. Simulation suggests both leads to significant improvements in generalization-ability.

翻译：尽管取得了显著成功，但包含大量可调参数的深度神经网络（DNN）在很大程度上仍是黑箱。为揭示DNN隐藏层的工作机制，我们采用一种称为教师-学生设置的统计力学方法，研究由输入数c的感知器构成的宽度为N、深度为L的DNN的监督学习过程。我们考虑一个学生机器集成，该集成能精确复现教师机器提供的M组N维输入/输出关系。我们通过复制方法（H. Yoshino (2020)）从理论上分析该集成，并采用贪婪蒙特卡罗模拟进行数值研究。适用于高维数据（N≫1）的复制理论在固定α=M/c的“稠密极限”（N≫c≫1且M≫1）下变得精确。理论与模拟均表明，DNN的学习过程在网络空间中表现出显著异质性：靠近输入/输出边界的层内机器配置相关性更强，而由于过度参数化，中心区域的相关性则弱得多。得益于相关性较弱的中心区域，足够深的系统能更快松弛。值得注意的是，理论与模拟均揭示了学生机器的泛化能力即使在深度极限（L≫1）下——此时系统呈现强过度参数化——也不会消失。我们还通过将隐藏流形模型（S. Goldt等人 (2020)）融入我们的模型，考虑了数据有效维度D（≤N）的影响。复制理论表明，通过减小宽度N或降低数据有效维度D，网络不同节点间相关性的回路修正（即对稠密极限的修正）会增强。模拟结果显示，这两种策略均能显著提升泛化能力。