The success of Transformer models has pushed the deep learning model scale to billions of parameters. Due to the limited memory resource of a single GPU, However, the best practice for choosing the optimal parallel strategy is still lacking, since it requires domain expertise in both deep learning and parallel computing. The Colossal-AI system addressed the above challenge by introducing a unified interface to scale your sequential code of model training to distributed environments. It supports parallel training methods such as data, pipeline, tensor, and sequence parallelism, as well as heterogeneous training methods integrated with zero redundancy optimizer. Compared to the baseline system, Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
翻译:Transformer模型的成功将深度学习模型的规模推向了数十亿参数。然而,由于单个GPU的显存资源有限,目前仍缺乏选择最优并行策略的最佳实践,因为这需要同时具备深度学习和并行计算领域的专业知识。Colossal-AI系统通过引入统一接口,将模型训练的串行代码扩展到分布式环境,从而解决了上述挑战。该系统支持数据并行、流水线并行、张量并行和序列并行等并行训练方法,以及集成零冗余优化器的异构训练方法。与基线系统相比,Colossal-AI在大规模模型上可实现最高2.76倍的训练加速。