In the evolving landscape of neural network models, one prominent challenge stand out: the significant memory overheads associated with training expansive models. Addressing this challenge, this study delves deep into the Rotated Tensor Parallelism (RTP). RTP is an innovative approach that strategically focuses on memory deduplication in distributed training environments. It boasts of unique features like a customized communication primitive and the Flyweight Pattern initialization. Furthermore, RTP ensures a seamless overlap between partition computation and partition weight communication, optimizing the training process. Our empirical evaluations underscore RTP's efficiency, revealing that its memory consumption during distributed system training is remarkably close to the optimal - distributing the memory overhead of a single machine equitably among multiple machines. The experimental results demonstrate that RTP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of memory. Code of RTP is available at https://github.com/wdlctc/rtp.
翻译:在神经网络模型不断演进的过程中,一个突出挑战浮现:训练大规模模型带来的显著内存开销。针对这一挑战,本研究深入探讨了旋转张量并行(RTP)。RTP是一种创新方法,策略性地聚焦于分布式训练环境中的内存去重。它具备独特特性,如定制化通信原语和享元模式初始化。此外,RTP确保分区计算与分区权重通信之间无缝重叠,从而优化训练过程。我们的实证评估凸显了RTP的效率,表明其在分布式系统训练中的内存消耗极为接近最优——将单台机器的内存开销均等地分配到多台机器上。实验结果表明,RTP能够实现与分布式数据并行相当的性能,同时支持显著更大的模型,并在内存方面具有近乎线性的可扩展性。RTP代码见 https://github.com/wdlctc/rtp。