Message Passing Interface (MPI) is a foundational programming model for high-performance computing. MPI libraries traditionally employ network interconnects (e.g., Ethernet and InfiniBand) and network protocols (e.g., TCP and RoCE) with complex software stacks for cross-node communication. We present cMPI, the first work to optimize MPI point-to-point communication (both one-sided and two-sided) using CXL memory sharing on a real CXL platform, transforming cross-node communication into memory transactions and data copies within CXL memory, bypassing traditional network protocols. We analyze performance across various interconnects and find that CXL memory sharing achieves 7.2x-8.1x lower latency than TCP-based interconnects deployed in small- and medium-scale clusters. We address challenges of CXL memory sharing for MPI communication, including data object management over the dax representation [50], cache coherence, and atomic operations. Overall, cMPI outperforms TCP over standard Ethernet NIC and high-end SmartNIC by up to 49x and 72x in latency and bandwidth, respectively, for small messages.
翻译:消息传递接口(MPI)是高性能计算的基础编程模型。传统MPI库采用网络互连技术(如以太网和InfiniBand)及网络协议(如TCP和RoCE),并依赖复杂的软件栈实现跨节点通信。本文提出cMPI,这是首个在真实CXL平台上利用CXL内存共享优化MPI点对点通信(包括单边与双边)的研究,通过将跨节点通信转化为CXL内存内的内存事务与数据拷贝,绕过了传统网络协议。我们分析了多种互连方案的性能,发现CXL内存共享在中小规模集群中比基于TCP的互连方案降低7.2倍至8.1倍的延迟。针对CXL内存共享在MPI通信中面临的挑战,包括基于dax表示[50]的数据对象管理、缓存一致性及原子操作等问题,我们提出了相应解决方案。总体而言,对于小消息传输,cMPI在延迟和带宽上分别比标准以太网网卡及高端智能网卡上的TCP协议提升最高达49倍和72倍。