This paper proposes ScalePool, a novel cluster architecture designed to interconnect numerous accelerators using unified hardware interconnects rather than traditional long-distance networking. ScalePool integrates Accelerator-Centric Links (XLink) and Compute Express Link (CXL) into a unified XLink-CXL hybrid fabric. Specifically, ScalePool employs XLink for intra-cluster, low-latency accelerator communication, while using hierarchical CXL-based switching fabrics for scalable and coherent inter-cluster memory sharing. By abstracting interfaces through CXL, ScalePool structurally resolves interoperability constraints, enabling heterogeneous cluster operation and composable resource disaggregation. In addition, ScalePool introduces explicit memory tiering: the latency-critical tier-1 combines accelerator-local memory with coherence-centric CXL and XLink, whereas the highcapacity tier-2 employs dedicated memory nodes interconnected by a CXL-based fabric, achieving scalable and efficient memory pooling. Evaluation results show that ScalePool accelerates LLM training by 1.22x on average and up to 1.84x compared to conventional RDMA-based environments. Furthermore, the proposed tier-2 memory disaggregation strategy reduces latency by up to 4.5x for memory-intensive workloads.
翻译:本文提出ScalePool,一种新颖的集群架构,旨在通过统一的硬件互连而非传统远距离网络技术实现大量加速器的互联。ScalePool将加速器中心化链路(XLink)与计算快速链路(CXL)集成到统一的XLink-CXL混合互连架构中。具体而言,ScalePool采用XLink实现集群内低延迟的加速器通信,同时利用基于CXL的分层交换架构实现可扩展且一致性的集群间内存共享。通过CXL抽象接口,ScalePool从结构上解决了互操作性限制,支持异构集群运行与可组合式资源解聚。此外,ScalePool引入显式内存分层机制:延迟敏感的第一层将加速器本地内存与以一致性为核心的CXL及XLink相结合,而高容量的第二层则采用通过CXL互连架构连接的专用内存节点,实现可扩展且高效的内存池化。评估结果表明,与传统基于RDMA的环境相比,ScalePool平均将大语言模型训练速度提升1.22倍,最高可达1.84倍。所提出的第二层内存解聚策略在内存密集型工作负载下将延迟降低最高达4.5倍。