The emergence of Large Language Model(LLM) technologies has led to a rapidly growing demand for compute resources in models. In response, the enterprises are building large-scale multi-tenant GPU clusters with 10k or even ore GPUs. In contrast to the rapidly growing cluster size, the bandwidth of clusters has also been increasing to meet communication demands, with 800 Gbps optical modules already in practical use and 1.6 Tbps modules on the horizon. However, designing clusters that simultaneously meet the requirements of large scale and high bandwidth is challenging due to the limited capacity of electrical switch chips. Unlike electrical switch chips, the single-port bandwidth of MEMS-OCS is solely determined by the optical module, making it straightforward to achieve both bandwidth and scability requirement. In this paper, we propose an opto-electronic hybrid architecture called \textbf{LumosCore}. We address the issues of L2 protocols incompatibility potential network contention and algorithm time complexity through physical topology and logical topology design. Additionally, we design a polynomial-time complexity link reconfiguration algorithm to reconfigure MEMS-OCS with minimal time overhead. We validate the feasibility of the proposed scheme in a cluster consisting of 128 NPUs, and through simulation based on real traces, we demonstrate the superiority of \textbf{LumosCore} over traditional architectures.
翻译:大型语言模型(LLM)技术的兴起导致模型对计算资源的需求快速增长。为此,企业正在构建包含上万甚至更多GPU的大规模多租户GPU集群。与集群规模的快速增长相对应,集群带宽也在不断提升以满足通信需求,800 Gbps光模块已投入实际应用,1.6 Tbps模块亦即将问世。然而,由于电交换芯片的容量限制,设计同时满足大规模与高带宽要求的集群具有挑战性。与电交换芯片不同,MEMS-OCS的单端口带宽完全由光模块决定,使其能够直接满足带宽与可扩展性需求。本文提出一种光电混合架构——\textbf{LumosCore}。我们通过物理拓扑与逻辑拓扑设计,解决了L2协议不兼容、潜在网络拥塞及算法时间复杂度等问题。此外,我们设计了一种多项式时间复杂度的链路重配置算法,能以最小时间开销对MEMS-OCS进行重配置。我们在包含128个NPU的集群中验证了所提方案的可行性,并基于真实轨迹的仿真实验证明了\textbf{LumosCore}相较于传统架构的优越性。