Analyzing Persistent Alltoallv RMA Implementations for High-Performance MPI Communication

Collective communication operations such as MPI_Alltoallv are central to many HPC applications, particularly those with irregular message sizes. We design, implement, and evaluate persistent MPI RMA variants of Alltoallv based on fence and lock synchronization, separating a one time initialization phase from per iteration execution to enable reuse of communication metadata and window state across repeated epochs. Our benchmarks tested on LLNL's Dane supercomputer show that the fence-persistent variant consistently outperforms the non-persistent baseline for large message sizes, achieving up to 44% reduction in runtime and improving scalability with increasing process counts; at 448 processes the runtime decreases from 2.49s to 1.54s (38% faster). We further evaluate the algorithms under irregular sparse communication patterns and compare fence- and lock-based designs, including hierarchical extensions. Message-size sweeps and a break-even model demonstrate that persistence provides immediate payoff for messages greater or equal to 32,768 bytes, while smaller messages show limited benefit due to metadata amortization costs. These results indicate that persistent RMA Alltoallv is a practical approach for workloads with large messages, where removing repeated metadata processing leaves runtime dominated by data movement, as evidenced by the increasing time savings with message size, and they clarify the trade-offs between fence and lock synchronization on modern HPC systems.

翻译：诸如MPI_Alltoallv等集合通信操作是众多高性能计算应用的核心，尤其对于具有不规则消息大小的应用而言。我们基于fence和lock同步机制，设计、实现并评估了Alltoallv的持久化MPI RMA变体，将一次性初始化阶段与每次迭代执行分离，从而允许在重复的周期中重用通信元数据和窗口状态。我们在LLNL的Dane超级计算机上的基准测试表明，对于大消息大小，fence持久化变体始终优于非持久化基线，实现了高达44%的运行时间减少，并随着进程数的增加改善了可扩展性；在448个进程时，运行时间从2.49秒降至1.54秒（性能提升38%）。我们进一步在不规则稀疏通信模式下评估了这些算法，并比较了基于fence和lock的设计，包括层次化扩展。消息大小扫描和盈亏平衡模型表明，对于大于或等于32,768字节的消息，持久化能立即带来收益，而由于元数据分摊成本，较小消息的收益有限。这些结果表明，持久化RMA Alltoallv是处理大消息工作负载的一种实用方法，因为重复元数据处理的消除使得运行时间主要由数据移动主导，这从时间节省随消息大小增加而增长的事实中得到证明，同时也阐明了在现代HPC系统上fence和lock同步之间的权衡。

相关内容

元数据

关注 7

元数据（Metadata），又称元数据、中介数据、中继数据[来源请求]，为描述数据的数据（data about data），主要是描述数据属性（property）的信息，用来支持如指示存储位置、历史数据、资源查找、文件纪录等功能。元数据算是一种电子式目录，为了达到编制目录的目的，必须在描述并收藏数据的内容或特色，进而达成协助数据检索的目的。

从静态模板到动态运行时图：大语言模型智能体（LLM Agents）工作流优化综述

专知会员服务

23+阅读 · 3月30日

多智能体通信：多智能体强化学习到涌现语言和大语言模型的综述

专知会员服务

16+阅读 · 2月13日

视觉语义通信综述：分类体系、体系架构、关键赋能技术及应用现状

专知会员服务

18+阅读 · 2月2日

多智能体强化学习中的稳健且高效的通信

专知会员服务

25+阅读 · 2025年11月17日