Mixture-of-experts (MoE) architectures have turned LLM serving into a cluster-scale workload in which communication consumes a considerable portion of LLM serving runtime. This has prompted industry to invest heavily in expensive high-bandwidth scale-up networks. We question whether such costly infrastructure is strictly necessary. We present the first systematic cross-layer analysis of network cost-effectiveness for MoE LLM serving, comparing four representative XPU (e.g., GPU/TPU) topologies (scale-up, scale-out, 3D torus, and 3D full-mesh). We find that lower-cost switchless topologies are more cost-effective than the scale-up topology across all serving scenarios explored, improving cost-effectiveness by 20.6-56.2%. In particular, the 3D full-mesh topology is Pareto-optimal in terms of the performance-cost tradeoff. We also find that current scale-up link bandwidths are over-provisioned: reducing the link bandwidth improves throughput per cost by up to 27%. A forward-looking analysis of upcoming GPU generations indicates that the cost-performance advantage of switchless networks will likely persist.
翻译:混合专家(MoE)架构将大语言模型服务转变为集群级工作负载,其中通信占据了服务运行时的显著比例。这促使行业大量投资于昂贵的高带宽扩展网络。我们质疑这种昂贵基础设施是否绝对必要。本文首次系统性地对MoE大模型服务中的网络成本效益进行跨层分析,比较了四种代表性XPU(如GPU/TPU)拓扑结构(扩展网络、扩展网络、三维环形网络和三维全网格网络)。研究发现,在所有考察的服务场景中,低成本的直连拓扑比扩展拓扑更具成本效益,其成本效益提升20.6%~56.2%。特别地,三维全网格拓扑在性能与成本权衡方面达到帕累托最优。我们还发现当前扩展网络链路带宽存在过度配置:降低链路带宽可使每单位成本的吞吐量提升高达27%。对即将推出的GPU世代进行的展望性分析表明,直连网络的成本性能优势很可能持续存在。