The widespread adoption of language models (LMs) across multiple industries has caused huge surge in demand for GPUs. Training LMs requires tens of thousands of GPUs and housing them in the same datacenter (DCs) is becoming challenging. We focus on training such models across multiple DCs connected via Wide-Area-Network (WAN). We build ATLAS that speeds up such training time using novel temporal bandwidth sharing and many other design choices. While ATLAS improves the training time, it does not eliminate the bubbles (idle GPU cycles). We built BUBBLETEA that runs prefill-as-a-service (part of LM inference) during the bubbles that improves the GPU utilization substantially without any impact of training. Together, ATLAS and BUBBLETEA improve training time by up to 17X and achieve GPU utilization of up to 94%.
翻译:语言模型(LM)在各行业的广泛应用导致对GPU的需求急剧增长。训练LM需要数万个GPU,将它们集中部署在同一数据中心(DC)正变得日益困难。本文研究通过广域网(WAN)连接的多个DC间进行此类模型的训练。我们构建了ATLAS系统,该系统通过新颖的时域带宽共享机制及多项其他设计选择,显著缩短了此类训练时间。尽管ATLAS改善了训练时间,但并未消除训练过程中的气泡(GPU空闲周期)。为此,我们进一步开发了BUBBLETEA系统,在气泡期间运行预填充即服务(LM推理的部分环节),从而在不影响训练的前提下大幅提升GPU利用率。ATLAS与BUBBLETEA共同将训练时间提升最高达17倍,并使GPU利用率最高达到94%。