Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

from arxiv, Accepted at FlexScience'24' Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures (co-located with HPDC'24)

Transformers and LLMs have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is slow and often takes in the order of weeks or months. Thanks to 3D model parallelism (data, pipeline, and tensor-level parallelism), the training can scale to a large number of GPUs, which reduces the duration of the training but dramatically increases the cost. Even when a large number of GPUs are available, the aggregated GPU memory is often not enough to hold the full training state (optimizer state, model parameters, and gradients). To compensate, state-of-the-art approaches offload the optimizer state at least partially to the host memory and perform hybrid CPU-GPU computations. Such flexible solutions dramatically reduce the GPU memory utilization, which makes it feasible to run the training on a smaller number of GPUs at the cost of performance penalty. Unfortunately, the challenges and bottlenecks of adopting this strategy are not sufficiently studied by state-of-the-art, which results in poor management of the combined host-GPU memory and poor overlapping between data movements and computations. In this paper, we aim to fill this gap by characterizing the behavior of offloaded training using the DeepSpeed runtime. Specifically, we study the GPU memory utilization over time during each iteration, the activity on the PCIe related to transfers between the host memory and the GPU memory, and the relationship between resource utilization and the steps involved in each iteration. Thanks to this study, we reveal opportunities for future improvements of offloading solutions, which enable greater flexibility to optimize the cost-performance trade-off in the context of transformer and LLM training.

翻译：Transformer与大型语言模型已在各领域得到迅速应用。其参数量已激增至数千亿级别并持续增长。在此背景下，Transformer模型的训练过程缓慢，通常需要数周甚至数月时间。借助三维模型并行技术（数据并行、流水线并行及张量并行），训练可扩展至大量GPU，从而缩短训练时长，但成本也急剧增加。即使拥有大量GPU，其聚合内存仍常不足以容纳完整的训练状态（优化器状态、模型参数及梯度）。为应对此问题，现有先进方案至少将部分优化器状态卸载至主机内存，并执行混合CPU-GPU计算。此类灵活方案显著降低了GPU内存占用，使得在少量GPU上运行训练成为可能，但需以性能损失为代价。遗憾的是，现有研究对采用该策略的挑战与瓶颈分析不足，导致主机-GPU混合内存管理效率低下，且数据迁移与计算过程的重叠度不佳。本文旨在通过分析DeepSpeed运行时的卸载训练行为来填补这一研究空白。具体而言，我们研究了每轮迭代过程中GPU内存利用率随时间的变化、与主机-GPU内存传输相关的PCIe活动，以及资源利用率与迭代各步骤间的关联。基于此研究，我们揭示了卸载方案未来的改进方向，从而为Transformer与大型语言模型训练场景中成本-性能权衡的优化提供更大灵活性。