This paper challenges the well-established paradigm for building any-to-any networks for training Large Language Models (LLMs). We show that LLMs exhibit a unique communication pattern where only small groups of GPUs require high-bandwidth communication to achieve near-optimal training performance. Across these groups of GPUs, the communication is insignificant and homogeneous. We propose a new network architecture that resembles the communication requirement of LLMs. Our architecture partitions the cluster into sets of GPUs interconnected with non-blocking any-to-any high-bandwidth interconnects that we call HB domains. Across the HB domains, the network only connects GPUs with non-zero communication demands. We develop an analytical formulation of the training iteration time to evaluate our proposal. Our formulation closely estimates the hardware floating-point utilization within 0.15\% from the ground truth established in prior studies for larger models. We show that our proposed architecture reduces the network cost by 37% to 75% compared to the state-of-the-art any-to-any Clos networks without compromising the performance of LLM training.
翻译:本文挑战了为训练大规模语言模型(LLMs)构建任意互联网络的既有范式。我们发现LLMs展现出独特的通信模式:仅需少量GPU组之间的高带宽通信即可实现接近最优的训练性能,而跨组的通信量微小且同质。我们提出一种匹配LLM通信需求的新型网络架构。该架构将集群划分为多组GPU,每组内部采用非阻塞的任意互联高带宽互连(称为HB域)。在HB域之间,网络仅连接存在非零通信需求的GPU。我们建立了训练迭代时间的分析公式以评估该方案,该公式对大型模型硬件浮点利用率的估计误差与已有研究基准相比不超过0.15%。实验表明,与当前最优的任意互联Clos网络相比,所提架构在不影响LLM训练性能的前提下,将网络成本降低了37%至75%。