The widespread adoption of language models (LMs) has caused a huge surge in demand for GPUs. Training large LMs requires tens of thousands of GPUs and housing them in the same datacenter (DC) is a challenge due to many constraints including availability of peak power. We focus on training such models across multiple DCs connected via the Wide-Area-Network (WAN). We built Atlas that speeds up the training time using novel workload-aware temporal bandwidth sharing and other design choices. While Atlas improves the training time, it does not completely eliminate the bubbles (idle GPU cycles). We built BubbleTea that runs prefill-as-a-service (part of LM inference) during the bubbles thus improving the GPU utilization without any impact on training. Compared to state-of-the-art designs, Atlas and BubbleTea together achieve up to 17x faster training, and up to 94% GPU utilization. The code will be open-sourced.
翻译:语言模型(LM)的广泛采用导致了对GPU需求的急剧增长。训练大型LM需要数万个GPU,而由于峰值电力供应等多种限制,将它们安置在同一数据中心(DC)内是一项挑战。我们专注于通过广域网(WAN)连接的多个DC之间训练此类模型。我们构建了Atlas,它利用新颖的、基于工作负载感知的时域带宽共享及其他设计选择来加速训练时间。虽然Atlas改善了训练时间,但并未完全消除气泡(GPU空闲周期)。我们构建了BubbleTea,它在气泡期间运行预填充即服务(LM推理的一部分),从而在不影响训练的前提下提高了GPU利用率。与最先进的设计相比,Atlas和BubbleTea共同实现了高达17倍的训练加速,以及高达94%的GPU利用率。代码将开源发布。