Relieving the reliance of neural network training on a global back-propagation (BP) has emerged as a notable research topic due to the biological implausibility and huge memory consumption caused by BP. Among the existing solutions, local learning optimizes gradient-isolated modules of a neural network with local errors and has been proved to be effective even on large-scale datasets. However, the reconciliation among local errors has never been investigated. In this paper, we first theoretically study non-greedy layer-wise training and show that the convergence cannot be assured when the local gradient in a module w.r.t. its input is not reconciled with the local gradient in the previous module w.r.t. its output. Inspired by the theoretical result, we further propose a local training strategy that successively regularizes the gradient reconciliation between neighboring modules without breaking gradient isolation or introducing any learnable parameters. Our method can be integrated into both local-BP and BP-free settings. In experiments, we achieve significant performance improvements compared to previous methods. Particularly, our method for CNN and Transformer architectures on ImageNet is able to attain a competitive performance with global BP, saving more than 40% memory consumption.
翻译:缓解神经网络训练对全局反向传播(BP)的依赖已成为显著研究课题,因其存在生物学不合理性及巨大内存消耗问题。现有解决方案中,局部学习通过局部误差训练神经网络的梯度隔离模块,即使在大规模数据集上也被证明有效。然而,局部误差间的协调机制此前从未被探究。本文首先从理论上研究非贪婪逐层训练,证明当模块中关于其输入的局部梯度与前一模块中关于其输出的局部梯度未达成协调时,收敛性无法保证。受此理论结果启发,我们进一步提出一种局部训练策略,通过连续正则化相邻模块间的梯度协调,既不破坏梯度隔离也不引入任何可学习参数。该方法可集成至局部反向传播和无反向传播两种设置中。实验表明,我们的方法相较先前方法取得了显著性能提升。特别地,针对ImageNet数据集上的CNN和Transformer架构,该方法能达到与全局反向传播相竞争的性能,同时节省超过40%的内存消耗。