End-to-end backpropagation has a few shortcomings: it requires loading the entire model during training, which can be impossible in constrained settings, and suffers from three locking problems (forward locking, update locking and backward locking), which prohibit training the layers in parallel. Solving layer-wise optimization problems can address these problems and has been used in on-device training of neural networks. We develop a layer-wise training method, particularly welladapted to ResNets, inspired by the minimizing movement scheme for gradient flows in distribution space. The method amounts to a kinetic energy regularization of each block that makes the blocks optimal transport maps and endows them with regularity. It works by alleviating the stagnation problem observed in layer-wise training, whereby greedily-trained early layers overfit and deeper layers stop increasing test accuracy after a certain depth. We show on classification tasks that the test accuracy of block-wise trained ResNets is improved when using our method, whether the blocks are trained sequentially or in parallel.
翻译:端到端反向传播存在若干缺陷:训练时需要加载整个模型,这在资源受限的场景下难以实现;同时面临三种锁定问题(前向锁定、更新锁定和反向锁定),导致无法并行训练各层。通过解决逐层优化问题可应对上述挑战,该方法已被应用于神经网络的设备端训练。本文受分布空间中梯度流的最小移动方案启发,提出了一种特别适配于残差网络的逐层训练方法。该方法通过对每个块施加动能正则化,使其成为最优传输映射并具备正则性。它通过缓解逐层训练中出现的停滞问题——即贪婪训练的早期层过拟合而深层在特定深度后测试精度不再提升——来发挥作用。在分类任务上的实验表明,无论采用顺序训练还是并行训练,使用本方法的逐块训练残差网络在测试精度上均有所提升。