AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art, transformer-based model today requires use of GPU-accelerated high performance computers with high-speed interconnects. As datasets and models continue to increase in size, computational requirements and memory demands for AI also continue to grow. These challenges have inspired the development of distributed algorithm and circuit-based optimization techniques that enable the ability to progressively scale models in multi-node environments, efficiently minimize neural network cost functions for faster convergence, and store more parameters into a set number of available resources. In our research project, we focus on parallel and distributed machine learning algorithm development, specifically for optimizing the data processing and pre-training of a set of 5 encoder-decoder LLMs, ranging from 580 million parameters to 13 billion parameters. We performed a fine-grained study to quantify the relationships between three ML parallelism methods, specifically exploring Microsoft DeepSpeed Zero Redundancy Optimizer (ZeRO) stages.
翻译:人工智能加速器的处理能力与内存限制在很大程度上决定了机器学习工作负载(如训练和推理)在可接受时间范围内的执行规模。当前,训练基于Transformer架构的最先进模型需要借助配备高速互连的GPU加速高性能计算机。随着数据集和模型规模持续增长,人工智能对计算能力和内存的需求也不断提升。这些挑战推动了分布式算法与电路优化技术的发展,使其能够在多节点环境中逐步扩展模型规模、高效最小化神经网络代价函数以加速收敛,并在有限资源中存储更多参数。在本研究项目中,我们聚焦于并行与分布式机器学习算法开发,特别针对一组包含5.8亿至130亿参数的编码器-解码器大型语言模型,优化其数据处理与预训练过程。通过细粒度研究,我们量化分析了三种机器学习并行方法之间的关联,重点探讨了微软DeepSpeed零冗余优化器的各个阶段(ZeRO stage)。