In the exascale computing era, optimizing MPI collective performance in high-performance computing (HPC) applications is critical. Current algorithms face performance degradation due to system call overhead, page faults, or data-copy latency, affecting HPC applications' efficiency and scalability. To address these issues, we propose PiP-MColl, a Process-in-Process-based Multi-object Inter-process MPI Collective design that maximizes small message MPI collective performance at scale. PiP-MColl features efficient multiple sender and receiver collective algorithms and leverages Process-in-Process shared memory techniques to eliminate unnecessary system call, page fault overhead, and extra data copy, improving intra- and inter-node message rate and throughput. Our design also boosts performance for larger messages, resulting in comprehensive improvement for various message sizes. Experimental results show that PiP-MColl outperforms popular MPI libraries, including OpenMPI, MVAPICH2, and Intel MPI, by up to 4.6X for MPI collectives like MPI_Scatter and MPI_Allgather.
翻译:在百亿亿次计算时代,优化高性能计算(HPC)应用中MPI集合操作的性能至关重要。现有算法因系统调用开销、缺页异常或数据拷贝延迟而导致性能下降,影响HPC应用的效率和可扩展性。为解决这些问题,我们提出PiP-MColl——一种基于进程内进程的多目标进程间MPI集合设计,旨在大规模系统中最大化小消息MPI集合操作的性能。PiP-MColl采用高效的多发送端和多接收端集合算法,并利用进程内进程共享内存技术,消除不必要的系统调用、缺页异常开销和额外数据拷贝,从而提升节点内和节点间的消息速率与吞吐量。该设计还能提升大消息的性能,实现对各种消息大小的全面优化。实验结果表明,PiP-MColl在MPI_Scatter和MPI_Allgather等MPI集合操作上,性能较主流MPI库(包括OpenMPI、MVAPICH2和Intel MPI)提升高达4.6倍。