Sequential models, such as Recurrent Neural Networks and Neural Ordinary Differential Equations, have long suffered from slow training due to their inherent sequential nature. For many years this bottleneck has persisted, as many thought sequential models could not be parallelized. We challenge this long-held belief with our parallel algorithm that accelerates GPU evaluation of sequential models by up to 3 orders of magnitude faster without compromising output accuracy. The algorithm does not need any special structure in the sequential models' architecture, making it applicable to a wide range of architectures. Using our method, training sequential models can be more than 10 times faster than the common sequential method without any meaningful difference in the training results. Leveraging this accelerated training, we discovered the efficacy of the Gated Recurrent Unit in a long time series classification problem with 17k time samples. By overcoming the training bottleneck, our work serves as the first step to unlock the potential of non-linear sequential models for long sequence problems.
翻译:序列模型,如循环神经网络和神经常微分方程,由于其固有的序列特性,长期以来训练速度缓慢。多年来这一瓶颈持续存在,因为许多人认为序列模型无法并行化。我们通过提出的并行算法挑战了这一长期观念,该算法在保持输出精度的前提下,将序列模型的GPU评估速度提升至多三个数量级。该算法无需序列模型架构具有任何特殊结构,因此适用于多种架构。使用我们的方法,训练序列模型的速度可比常规序列方法快10倍以上,且训练结果无显著差异。借助这种加速训练,我们在一个包含17000个时间样本的长时序分类问题中发现了门控循环单元的有效性。通过克服训练瓶颈,我们的工作为释放非线性序列模型在长序列问题上的潜力迈出了第一步。