Training large deep learning models requires parallelization techniques to scale. In existing methods such as Data Parallelism or ZeRO-DP, micro-batches of data are processed in parallel, which creates two drawbacks: the total memory required to store the model's activations peaks at the end of the forward pass, and gradients must be simultaneously averaged at the end of the backpropagation step. We propose Cyclic Data Parallelism, a novel paradigm shifting the execution of the micro-batches from simultaneous to sequential, with a uniform delay. At the cost of a slight gradient delay, the total memory taken by activations is constant, and the gradient communications are balanced during the training step. With Model Parallelism, our technique reduces the number of GPUs needed, by sharing GPUs across micro-batches. Within the ZeRO-DP framework, our technique allows communication of the model states with point-to-point operations rather than a collective broadcast operation. We illustrate the strength of our approach on the CIFAR-10 and ImageNet datasets.
翻译:训练大型深度学习模型需要并行化技术来实现扩展。在现有的数据并行或ZeRO-DP等方法中,数据微批次被并行处理,这带来了两个缺陷:存储模型激活值所需的总内存在前向传播结束时达到峰值,且梯度必须在反向传播步骤结束时同时进行平均。我们提出循环数据并行,这是一种新颖的范式,将微批次的执行从并行转变为带有统一延迟的串行。以轻微的梯度延迟为代价,激活值占用的总内存保持恒定,且梯度通信在训练步骤中保持平衡。结合模型并行,我们的技术通过跨微批次共享GPU,减少了所需的GPU数量。在ZeRO-DP框架内,我们的技术允许通过点对点操作而非集体广播操作来通信模型状态。我们在CIFAR-10和ImageNet数据集上展示了本方法的优势。