It is unclear how changing the learning rule of a deep neural network alters its learning dynamics and representations. To gain insight into the relationship between learned features, function approximation, and the learning rule, we analyze infinite-width deep networks trained with gradient descent (GD) and biologically-plausible alternatives including feedback alignment (FA), direct feedback alignment (DFA), and error modulated Hebbian learning (Hebb), as well as gated linear networks (GLN). We show that, for each of these learning rules, the evolution of the output function at infinite width is governed by a time varying effective neural tangent kernel (eNTK). In the lazy training limit, this eNTK is static and does not evolve, while in the rich mean-field regime this kernel's evolution can be determined self-consistently with dynamical mean field theory (DMFT). This DMFT enables comparisons of the feature and prediction dynamics induced by each of these learning rules. In the lazy limit, we find that DFA and Hebb can only learn using the last layer features, while full FA can utilize earlier layers with a scale determined by the initial correlation between feedforward and feedback weight matrices. In the rich regime, DFA and FA utilize a temporally evolving and depth-dependent NTK. Counterintuitively, we find that FA networks trained in the rich regime exhibit more feature learning if initialized with smaller correlation between the forward and backward pass weights. GLNs admit a very simple formula for their lazy limit kernel and preserve conditional Gaussianity of their preactivations under gating functions. Error modulated Hebb rules show very small task-relevant alignment of their kernels and perform most task relevant learning in the last layer.
翻译:目前尚不清楚改变深度神经网络的学习规则会如何改变其学习动态和表征。为深入了解学习特征、函数逼近与学习规则之间的关系,我们分析了采用梯度下降(GD)以及生物可替代方案(包括反馈对齐(FA)、直接反馈对齐(DFA)、误差调制赫布学习(Hebb))和门控线性网络(GLN)训练的无限宽度深度网络。我们证明,对于每种学习规则,无限宽度下输出函数的演化由时变有效神经正切核(eNTK)主导。在懒训练极限下,该eNTK是静态的且不演化,而在丰富平均场机制中,该核的演化可通过动态平均场理论(DMFT)自洽确定。该DMFT能够比较每种学习规则引发的特征和预测动力学。在懒极限下,我们发现DFA和Hebb只能利用最后一层特征进行学习,而完全FA能够利用更早的层,其利用程度由前馈与反馈权重矩阵的初始相关性决定。在丰富机制下,DFA和FA利用随时间演化的深度相关NTK。反直觉的是,我们发现在丰富机制下训练的FA网络,如果前向与反向传播权重的初始相关性较小,则表现出更强的特征学习。GLN的懒极限核具有非常简单的公式,并在门控函数下保持了其预激活的条件高斯性。误差调制赫布规则的核具有极小的任务相关对齐,并在最后一层执行大部分任务相关学习。