Parallelization techniques have become ubiquitous for accelerating inference and training of deep neural networks. Despite this, several operations are still performed in a sequential manner. For instance, the forward and backward passes are executed layer-by-layer, and the output of diffusion models is produced by applying a sequence of denoising steps. This sequential approach results in a computational cost proportional to the number of steps involved, presenting a potential bottleneck as the number of steps increases. In this work, we introduce DeepPCR, a novel algorithm which parallelizes typically sequential operations used in inference and training of neural networks. DeepPCR is based on interpreting a sequence of $L$ steps as the solution of a specific system of equations, which we recover using the Parallel Cyclic Reduction algorithm. This reduces the complexity of computing the sequential operations from $\mathcal{O}(L)$ to $\mathcal{O}(\log_2L)$, thus yielding a speedup for large $L$. To verify the theoretical lower complexity of the algorithm, and to identify regimes for speedup, we test the effectiveness of DeepPCR in parallelizing the forward and backward pass in multi-layer perceptrons, and reach speedups of up to $30\times$ for forward and $200\times$ for backward pass. We additionally showcase the flexibility of DeepPCR by parallelizing training of ResNets with as many as 1024 layers, and generation in diffusion models, enabling up to $7\times$ faster training and $11\times$ faster generation, respectively, when compared to the sequential approach.
翻译:并行化技术已成为加速深度神经网络推理与训练的常用方法。尽管如此,若干操作仍以顺序方式执行。例如,前向与反向传播需逐层进行,扩散模型的输出需通过一系列去噪步骤生成。这种顺序方法导致计算成本与步骤数量成正比,随着步骤增加,可能成为性能瓶颈。本文提出DeepPCR算法,该算法可并行化神经网络的推理与训练中通常采用的序列操作。DeepPCR的核心思想是将$L$个步骤的序列解释为特定方程组的解,并通过并行循环约简算法求解。这使得序列操作的计算复杂度从$\mathcal{O}(L)$降至$\mathcal{O}(\log_2L)$,从而在$L$较大时实现加速。为验证算法的理论低复杂度并识别加速适用场景,我们测试了DeepPCR在多层级感知机中并行化前向与反向传播的效果,分别实现了前向传播最高30倍、反向传播最高200倍的加速。此外,我们通过并行化含1024层的残差网络的训练以及扩散模型的生成过程,进一步展示了DeepPCR的灵活性:与顺序方法相比,训练速度最高提升7倍,生成速度最高提升11倍。