Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width -- which govern the evolution of observables during training -- saturate at a depth of $\sim 20$, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep nonlinear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.

翻译：将权重初始化为独立高斯分布的全连接深度神经网络可调节至临界状态，从而阻止信号在网络传播中的指数增长或衰减。然而，此类网络仍存在随深度线性增长的波动，这可能阻碍宽度与深度相当的网络的训练。我们通过解析证明，采用tanh激活函数且权重初始化为正交矩阵系综的矩形网络，其对应的预激活波动在宽度倒数的首阶近似下与深度无关。进一步，我们通过数值模拟表明，在初始化阶段，所有涉及神经正切核（NTK）及其衍生量（在宽度倒数的首阶近似下控制训练过程中可观测量的演化）的相关因子在深度约为20时趋于饱和，而非如高斯初始化情形那样无限增长。我们推测这种结构在降低整体噪声的同时保留了有限宽度下的特征学习能力，从而同时提升泛化性能与训练速度。通过将NTK经验测量值与在MNIST和CIFAR-10分类任务上采用全批次梯度下降训练的深度非线性正交网络的优越性能相关联，我们提供了部分实验验证。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日