Neural network training is inherently sequential where the layers finish the forward propagation in succession, followed by the calculation and back-propagation of gradients (based on a loss function) starting from the last layer. The sequential computations significantly slow down neural network training, especially the deeper ones. Prediction has been successfully used in many areas of computer architecture to speed up sequential processing. Therefore, we propose ADA-GP, that uses gradient prediction adaptively to speed up deep neural network (DNN) training while maintaining accuracy. ADA-GP works by incorporating a small neural network to predict gradients for different layers of a DNN model. ADA-GP uses a novel tensor reorganization to make it feasible to predict a large number of gradients. ADA-GP alternates between DNN training using backpropagated gradients and DNN training using predicted gradients. ADA-GP adaptively adjusts when and for how long gradient prediction is used to strike a balance between accuracy and performance. Last but not least, we provide a detailed hardware extension in a typical DNN accelerator to realize the speed up potential from gradient prediction. Our extensive experiments with fourteen DNN models show that ADA-GP can achieve an average speed up of 1.47x with similar or even higher accuracy than the baseline models. Moreover, it consumes, on average, 34% less energy due to reduced off-chip memory accesses compared to the baseline hardware accelerator.
翻译:神经网络训练本质上是顺序进行的:各层依次完成前向传播,随后从最后一层开始进行梯度计算与反向传播(基于损失函数)。这种顺序计算显著降低了神经网络训练速度,尤其对于深层网络。预测技术已在计算机体系结构的多个领域成功应用于加速顺序处理。为此,我们提出ADA-GP——一种自适应梯度预测方法,在保持精度的同时加速深度神经网络(DNN)训练。ADA-GP通过嵌入小型神经网络为DNN模型的不同层预测梯度,并采用创新的张量重组技术使大规模梯度预测成为可能。该方法在基于反向传播梯度的DNN训练与基于预测梯度的DNN训练之间交替进行,并通过自适应调整梯度预测的启用时机与持续时间,在精度与性能间取得平衡。最后,我们在典型DNN加速器中提供了详细的硬件扩展方案,以实现梯度预测带来的加速潜力。基于十四个DNN模型的大量实验表明,ADA-GP在保持与基线模型相当甚至更高精度的前提下,平均可实现1.47倍的加速效果;同时,由于减少了片外存储器访问,其平均能耗较基线硬件加速器降低34%。