Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Transformers and large language models~(LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is very expensive and often hits a ``memory wall'', i.e., even when using 3D parallelism (pipeline, tensor, data) and aggregating the memory of many GPUs, it is still not enough to hold the necessary data structures (model parameters, optimizer state, gradients, activations) in GPU memory. To compensate, state-of-the-art approaches offload the optimizer state, at least partially, to the host memory and perform hybrid CPU-GPU computations. However, the management of the combined host-GPU memory is often suboptimal and results in poor overlapping between data movements and computations. This leads to missed opportunities to simultaneously leverage the interconnect bandwidth and computational capabilities of CPUs and GPUs. In this paper, we leverage a key observation that the interleaving of the forward, backward, and update phases generates fluctuations in the GPU memory utilization, which can be exploited to dynamically move a part of the optimizer state between the host and the GPU memory at each iteration. To this end, we design and implement Deep Optimizer States, a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU based on our proposed performance model that addresses the trade-off between data movement cost, acceleration on the GPUs vs the CPUs, and competition for shared resources. We integrate our approach with DeepSpeed and demonstrate 2.5$\times$ faster iterations over state-of-the-art approaches using extensive experiments.

翻译：Transformer和大语言模型（LLMs）已迅速被各领域采用。其规模已爆炸性增长至数千亿参数并持续扩大。在此背景下，Transformer训练成本极高且常遭遇"内存瓶颈"——即便采用三维并行（流水线并行、张量并行、数据并行）并聚合众多GPU的内存，仍不足以在GPU内存中容纳必要的数据结构（模型参数、优化器状态、梯度、激活值）。为弥补这一缺陷，最先进的方法会将优化器状态（至少部分地）卸载至主机内存，并执行混合CPU-GPU计算。然而，主机与GPU组合内存的管理通常欠佳，导致数据移动与计算的重叠效率低下，从而错失同时利用互连带宽及CPU/GPU计算能力的良机。本文利用一项关键发现：前向、反向与更新阶段的交错进行会导致GPU内存利用率产生波动，可借此在每次迭代中动态地将部分优化器状态在主机与GPU内存间移动。为此，我们设计并实现了深度优化器状态（Deep Optimizer States）这一新型技术：将LLM划分为多个子组，依据我们提出的性能模型（该模型权衡了数据移动成本、GPU与CPU加速效率差异以及共享资源竞争）在CPU或GPU上调度其更新阶段。我们将该方法集成至DeepSpeed中，并通过大量实验表明，相较于现有最先进方法，迭代速度可提升2.5倍。