Emerging interconnects, such as CXL and NVLink, have been integrated into the intra-host topology to scale more accelerators and facilitate efficient communication between them, such as GPUs. To keep pace with the accelerator's growing computing throughput, the interconnect has seen substantial enhancement in link bandwidth, e.g., 256GBps for CXL 3.0 links, which surpasses Ethernet and InfiniBand network links by an order of magnitude or more. Consequently, when data-intensive jobs, such as LLM training, scale across multiple hosts beyond the reach limit of the interconnect, the performance is significantly hindered by the limiting bandwidth of the network infrastructure. We address the problem by proposing DFabric, a two-tier interconnect architecture. We address the problem by proposing DFabric, a two-tier interconnect architecture. First, DFabric disaggregates rack's computing units with an interconnect fabric, i.e., CXL fabric, which scales at rack-level, so that they can enjoy intra-rack efficient interconnecting. Second, DFabric disaggregates NICs from hosts, and consolidates them to form a NIC pool with CXL fabric. By providing sufficient aggregated capacity comparable to interconnect bandwidth, the NIC pool bridges efficient communication across racks or beyond the reach limit of interconnect fabric. However, the local memory accessing becomes the bottleneck when enabling each host to utilize the NIC pool efficiently. To the end, DFabric builds a memory pool with sufficient bandwidth by disaggregating host local memory and adding more memory devices. We have implemented a prototype of DFabric that can run applications transparently. We validated its performance gain by running various microbenchmarks and compute-intensive applications such as DNN and graph.
翻译:新兴互连技术(如CXL与NVLink)已被集成至主机内拓扑结构,以扩展更多加速器(如GPU)并促进其间高效通信。为匹配加速器日益增长的计算吞吐量,互连链路带宽已实现显著提升(例如CXL 3.0链路达256GBps),较以太网与InfiniBand网络链路高出一个数量级以上。因此,当数据密集型任务(如大语言模型训练)跨越多台主机扩展至互连技术覆盖范围外时,网络基础设施的有限带宽将严重制约性能。我们提出DFabric双层互连架构以解决该问题:首先,DFabric通过机架级扩展的互连架构(即CXL架构)解耦机架计算单元,使其在机架内实现高效互连;其次,DFabric将网卡从主机解耦,并通过CXL架构整合形成网卡资源池。该资源池提供与互连带宽相当的聚合容量,从而桥接跨机架或超出互连架构覆盖范围的高效通信。然而,在使各主机高效利用网卡资源池时,本地内存访问成为瓶颈。为此,DFabric通过解耦主机本地内存并增加存储设备,构建具备充足带宽的内存资源池。我们实现了可透明运行应用的DFabric原型系统,并通过多种微基准测试及计算密集型应用(如深度神经网络与图计算)验证了其性能增益。