This paper challenges the well-established paradigm for building any-to-any networks for training Large Language Models (LLMs). We show that LLMs exhibit a unique communication pattern where only small groups of GPUs require high-bandwidth any-to-any communication within them, to achieve near-optimal training performance. Across these groups of GPUs, the communication is insignificant, sparse, and homogeneous. We propose a new network architecture that closely resembles the communication requirement of LLMs. Our architecture partitions the cluster into sets of GPUs interconnected with non-blocking any-to-any high-bandwidth interconnects that we call HB domains. Across the HB domains, the network only connects GPUs with communication demands. We call this network a "rail-only" connection, and show that our proposed architecture reduces the network cost by up to 75% compared to the state-of-the-art any-to-any Clos networks without compromising the performance of LLM training.
翻译:本文挑战了在训练大语言模型(LLMs)时构建全互联(any-to-any)网络的既定范式。我们证明LLMs呈现出一种独特的通信模式:只需在较小规模的GPU组内部实现高带宽全互联通信,即可获得近乎最优的训练性能。而在这些GPU组之间,通信量微不足道、稀疏且具有同质性。我们提出了一种与LLMs通信需求高度契合的新型网络架构。该架构将集群划分为若干组GPU,每组内部通过非阻塞式全互联高带宽互连(称为HB域)进行连接。在HB域之间,网络仅连接存在通信需求的GPU。我们将这种网络称为"纯轨互联"(rail-only)网络。实验表明,与当前最先进的全互联Clos网络相比,本架构可在不牺牲LLM训练性能的前提下,将网络成本降低高达75%。