Neural network training is inherently sequential where the layers finish the forward propagation in succession, followed by the calculation and back-propagation of gradients (based on a loss function) starting from the last layer. The sequential computations significantly slow down neural network training, especially the deeper ones. Prediction has been successfully used in many areas of computer architecture to speed up sequential processing. Therefore, we propose ADA-GP, which uses gradient prediction adaptively to speed up deep neural network (DNN) training while maintaining accuracy. ADA-GP works by incorporating a small neural network to predict gradients for different layers of a DNN model. ADA-GP uses a novel tensor reorganization method to make it feasible to predict a large number of gradients. ADA-GP alternates between DNN training using backpropagated gradients and DNN training using predicted gradients. ADA-GP adaptively adjusts when and for how long gradient prediction is used to strike a balance between accuracy and performance. Last but not least, we provide a detailed hardware extension in a typical DNN accelerator to realize the speed up potential from gradient prediction. Our extensive experiments with fifteen DNN models show that ADA-GP can achieve an average speed up of 1.47X with similar or even higher accuracy than the baseline models. Moreover, it consumes, on average, 34% less energy due to reduced off-chip memory accesses compared to the baseline accelerator.
翻译:神经网络训练本质上是顺序进行的,各层依次完成前向传播,随后从最后一层开始计算梯度(基于损失函数)并执行反向传播。这种顺序计算显著拖慢了神经网络训练速度,尤其是深度较大的网络。预测技术已在计算机体系结构的多个领域成功用于加速顺序处理。为此,我们提出ADA-GP方法,通过自适应梯度预测来加速深度神经网络(DNN)训练,同时保持准确率。ADA-GP的核心是在DNN模型中嵌入一个小型神经网络,用于预测不同层的梯度。该方法采用新颖的张量重组技术,使得预测大量梯度成为可行。ADA-GP交替使用基于反向传播梯度的DNN训练与基于预测梯度的DNN训练,并自适应调整梯度预测的启用时机与持续时间,以在准确率与性能之间取得平衡。最后,我们在典型DNN加速器中提供了详细的硬件扩展方案,以充分发挥梯度预测带来的加速潜力。针对十五种DNN模型的广泛实验表明,ADA-GP在保持与基准模型相似甚至更高准确率的前提下,平均可实现1.47倍的加速。此外,由于减少了片外存储访问,与基准加速器相比,其平均能耗降低34%。