Collective communication operations such as MPI_Alltoallv are central to many HPC applications, particularly those with irregular message sizes. We design, implement, and evaluate persistent MPI RMA variants of Alltoallv based on fence and lock synchronization, separating a one time initialization phase from per iteration execution to enable reuse of communication metadata and window state across repeated epochs. Our benchmarks tested on LLNL's Dane supercomputer show that the fence-persistent variant consistently outperforms the non-persistent baseline for large message sizes, achieving up to 44% reduction in runtime and improving scalability with increasing process counts; at 448 processes the runtime decreases from 2.49s to 1.54s (38% faster). We further evaluate the algorithms under irregular sparse communication patterns and compare fence- and lock-based designs, including hierarchical extensions. Message-size sweeps and a break-even model demonstrate that persistence provides immediate payoff for messages greater or equal to 32,768 bytes, while smaller messages show limited benefit due to metadata amortization costs. These results indicate that persistent RMA Alltoallv is a practical approach for workloads with large messages, where removing repeated metadata processing leaves runtime dominated by data movement, as evidenced by the increasing time savings with message size, and they clarify the trade-offs between fence and lock synchronization on modern HPC systems.
翻译:诸如MPI_Alltoallv等集合通信操作是众多高性能计算应用的核心,尤其对于具有不规则消息大小的应用而言。我们基于fence和lock同步机制,设计、实现并评估了Alltoallv的持久化MPI RMA变体,将一次性初始化阶段与每次迭代执行分离,从而允许在重复的周期中重用通信元数据和窗口状态。我们在LLNL的Dane超级计算机上的基准测试表明,对于大消息大小,fence持久化变体始终优于非持久化基线,实现了高达44%的运行时间减少,并随着进程数的增加改善了可扩展性;在448个进程时,运行时间从2.49秒降至1.54秒(性能提升38%)。我们进一步在不规则稀疏通信模式下评估了这些算法,并比较了基于fence和lock的设计,包括层次化扩展。消息大小扫描和盈亏平衡模型表明,对于大于或等于32,768字节的消息,持久化能立即带来收益,而由于元数据分摊成本,较小消息的收益有限。这些结果表明,持久化RMA Alltoallv是处理大消息工作负载的一种实用方法,因为重复元数据处理的消除使得运行时间主要由数据移动主导,这从时间节省随消息大小增加而增长的事实中得到证明,同时也阐明了在现代HPC系统上fence和lock同步之间的权衡。