Forward gradient learning computes a noisy directional gradient and is a biologically plausible alternative to backprop for learning deep neural networks. However, the standard forward gradient algorithm, when applied naively, suffers from high variance when the number of parameters to be learned is large. In this paper, we propose a series of architectural and algorithmic modifications that together make forward gradient learning practical for standard deep learning benchmark tasks. We show that it is possible to substantially reduce the variance of the forward gradient estimator by applying perturbations to activations rather than weights. We further improve the scalability of forward gradient by introducing a large number of local greedy loss functions, each of which involves only a small number of learnable parameters, and a new MLPMixer-inspired architecture, LocalMixer, that is more suitable for local learning. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
翻译:前向梯度学习计算带有噪声的方向梯度,是反向传播算法的一种生物可替代方案,用于训练深度神经网络。然而,标准的前向梯度算法在直接应用于大量可学习参数时,会遭受高方差问题。本文提出了一系列架构和算法上的改进,使得前向梯度学习能够实际应用于标准深度学习基准任务。我们证明,通过对激活值而非权重施加扰动,可以显著降低前向梯度估计器的方差。此外,通过引入大量局部贪婪损失函数(每个损失函数仅涉及少量可学习参数)以及一种受MLPMixer启发、更适合局部学习的新架构LocalMixer,我们进一步提升了前向梯度的可扩展性。我们的方法在MNIST和CIFAR-10数据集上匹配了反向传播的性能,并在ImageNet上显著优于之前提出的无反向传播算法。