MPI implementations commonly rely on explicit memory-copy operations, incurring overhead from redundant data movement and buffer management. This overhead notably impacts HPC workloads involving intensive inter-processor communication. In response, we introduce MPI-over-CXL, a novel MPI communication paradigm leveraging CXL, which provides cache-coherent shared memory across multiple hosts. MPI-over-CXL replaces traditional data-copy methods with direct shared memory access, significantly reducing communication latency and memory bandwidth usage. By mapping shared memory regions directly into the virtual address spaces of MPI processes, our design enables efficient pointer-based communication, eliminating redundant copying operations. To validate this approach, we implement a comprehensive hardware and software environment, including a custom CXL 3.2 controller, FPGA-based multi-host emulation, and dedicated software stack. Our evaluations using representative benchmarks demonstrate substantial performance improvements over conventional MPI systems, underscoring MPI-over-CXL's potential to enhance efficiency and scalability in large-scale HPC environments.
翻译:MPI实现通常依赖于显式的内存拷贝操作,从而引入了冗余数据移动和缓冲区管理的开销。这种开销对涉及密集处理器间通信的高性能计算工作负载影响尤为显著。为此,我们提出了MPI-over-CXL,一种利用CXL的新型MPI通信范式。CXL能够在多个主机间提供缓存一致性的共享内存。MPI-over-CXL以直接的共享内存访问取代了传统的数据拷贝方法,显著降低了通信延迟和内存带宽占用。通过将共享内存区域直接映射到MPI进程的虚拟地址空间中,我们的设计实现了高效的基于指针的通信,消除了冗余的拷贝操作。为验证该方法,我们构建了一个完整的硬件与软件环境,包括定制的CXL 3.2控制器、基于FPGA的多主机仿真以及专用的软件栈。使用代表性基准测试进行的评估表明,相较于传统的MPI系统,我们的方案带来了显著的性能提升,这凸显了MPI-over-CXL在提升大规模高性能计算环境效率与可扩展性方面的潜力。