This paper challenges the well-established paradigm for building any-to-any networks for training Large Language Models (LLMs). We show that LLMs exhibit a unique communication pattern where only small groups of GPUs require high-bandwidth any-to-any communication within them, to achieve near-optimal training performance. Across these groups of GPUs, the communication is insignificant, sparse, and homogeneous. We propose a new network architecture that closely resembles the communication requirement of LLMs. Our architecture partitions the cluster into sets of GPUs interconnected with non-blocking any-to-any high-bandwidth interconnects that we call HB domains. Across the HB domains, the network only connects GPUs with communication demands. We call this network a "rail-only" connection, and show that our proposed architecture reduces the network cost by up to 75% compared to the state-of-the-art any-to-any Clos networks without compromising the performance of LLM training.
翻译:本文挑战了用于训练大语言模型(LLMs)的全互联(any-to-any)网络这一成熟范式。我们证明LLMs展现出独特的通信模式:仅需少量GPU组内具备高带宽全互联能力即可实现近乎最优的训练性能,而组间通信呈现不显著、稀疏且同质化的特征。为此,我们提出一种与LLMs通信需求高度契合的新型网络架构。该架构将集群划分为若干高带宽非阻塞全互联域(HB domain)内的GPU集合,域间仅连接存在通信需求的GPU。我们将这种网络称为“轨线直连”(rail-only)网络,并证明该架构相比当前最先进的全互联Clos网络可降低高达75%的网络成本,同时不损害LLM训练性能。