Reversible architectures have been shown to be capable of performing on par with their non-reversible architectures, being applied in deep learning for memory savings and generative modeling. In this work, we show how reversible architectures can solve challenges in parallelizing deep model training. We introduce PETRA, a novel alternative to backpropagation for parallelizing gradient computations. PETRA facilitates effective model parallelism by enabling stages (i.e., a set of layers) to compute independently on different devices, while only needing to communicate activations and gradients between each other. By decoupling the forward and backward passes and keeping a single updated version of the parameters, the need for weight stashing is also removed. We develop a custom autograd-like training framework for PETRA, and we demonstrate its effectiveness on CIFAR-10, ImageNet32, and ImageNet, achieving competitive accuracies comparable to backpropagation using ResNet-18, ResNet-34, and ResNet-50 models.
翻译:可逆架构已被证明能够与非可逆架构的性能相当,在深度学习中应用于内存节省和生成建模。本文展示了可逆架构如何解决深度模型训练并行化中的挑战。我们提出PETRA,一种用于并行化梯度计算的新型反向传播替代方案。PETRA通过使各阶段(即一组层)能够在不同设备上独立计算,同时仅需相互传递激活值和梯度,从而促进有效的模型并行。通过解耦前向传播与反向传播过程,并保持参数的单一更新版本,该方法也消除了权重暂存的需求。我们为PETRA开发了自定义的类自动微分训练框架,并在CIFAR-10、ImageNet32和ImageNet数据集上验证其有效性,使用ResNet-18、ResNet-34和ResNet-50模型实现了与反向传播相当的竞争性准确率。