Message Passing Interface (MPI) is a foundational programming model for high-performance computing. MPI libraries traditionally employ network interconnects (e.g., Ethernet and InfiniBand) and network protocols (e.g., TCP and RoCE) with complex software stacks for cross-node communication. We present cMPI, the first work to optimize MPI point-to-point communication (both one-sided and two-sided) using CXL memory sharing on a real CXL platform, transforming cross-node communication into memory transactions and data copies within CXL memory, bypassing traditional network protocols. We analyze performance across various interconnects and find that CXL memory sharing achieves 7.2x-8.1x lower latency than TCP-based interconnects deployed in small- and medium-scale clusters. We address challenges of CXL memory sharing for MPI communication, including data object management over the dax representation [50], cache coherence, and atomic operations. Overall, cMPI outperforms TCP over standard Ethernet NIC and high-end SmartNIC by up to 49x and 72x in latency and bandwidth, respectively, for small messages.
翻译:消息传递接口(MPI)是高性能计算的基础编程模型。传统的MPI库依赖网络互连(如以太网和InfiniBand)及网络协议(如TCP与RoCE),通过复杂的软件栈实现跨节点通信。本文提出cMPI,这是首个在真实CXL平台上利用CXL内存共享优化MPI点对点通信(包括单边与双边通信)的工作,将跨节点通信转化为CXL内存内部的内存事务与数据拷贝,从而绕过传统网络协议。我们分析了不同互连方案的性能,发现CXL内存共享的延迟比中小规模集群中部署的基于TCP的互连方案降低7.2倍至8.1倍。我们解决了CXL内存共享在MPI通信中面临的挑战,包括基于dax表示[50]的数据对象管理、缓存一致性及原子操作问题。总体而言,对于小消息传输,cMPI在延迟和带宽上分别比基于标准以太网网卡及高端智能网卡的TCP方案提升最高达49倍和72倍。