Greedy layer-wise or module-wise training of neural networks is compelling in constrained and on-device settings where memory is limited, as it circumvents a number of problems of end-to-end back-propagation. However, it suffers from a stagnation problem, whereby early layers overfit and deeper layers stop increasing the test accuracy after a certain depth. We propose to solve this issue by introducing a module-wise regularization inspired by the minimizing movement scheme for gradient flows in distribution space. We call the method TRGL for Transport Regularized Greedy Learning and study it theoretically, proving that it leads to greedy modules that are regular and that progressively solve the task. Experimentally, we show improved accuracy of module-wise training of various architectures such as ResNets, Transformers and VGG, when our regularization is added, superior to that of other module-wise training methods and often to end-to-end training, with as much as 60% less memory usage.
翻译:贪婪的逐层或逐模块神经网络训练在内存受限的边缘设备场景中具有吸引力,因为它规避了端到端反向传播的诸多问题。然而,该方法存在停滞问题:早期层会过拟合,而深层在达到一定深度后不再提升测试准确率。我们提出通过引入一种基于分布空间梯度流中最小移动方案的正则化方法来解决该问题,将此方法命名为TRGL(传输正则化贪婪学习)。我们从理论上证明,该方法能使贪婪模块保持正则化特性并逐步完成任务。实验表明,在ResNets、Transformers和VGG等不同架构中添加本正则化后,模块化训练的准确率显著提升,优于其他模块化训练方法,甚至常超越端到端训练,同时内存使用量最多可降低60%。