Irregular communication often limits both the performance and scalability of parallel applications. Typically, applications individually implement irregular messages using point-to-point communications, and any optimizations are added directly into the application. As a result, these optimizations lack portability. There is no easy way to optimize point-to-point messages within MPI, as the interface for single messages provides no information on the collection of all communication to be performed. However, the persistent neighbor collective API, released in the MPI 4 standard, provides an interface for portable optimizations of irregular communication within MPI libraries. This paper presents methods for optimizing irregular communication within neighborhood collectives, analyzes the impact of replacing point-to-point communication in existing codebases such as Hypre BoomerAMG with neighborhood collectives, and finally shows an up to 1.32x speedup on sparse matrix-vector multiplication within a BoomerAMG solve through the use of our optimized neighbor collectives. The authors analyze multiple implementations of neighborhood collectives, including a standard implementation, which simply wraps standard point-to-point communication, as well as multiple implementations of locality-aware aggregation. All optimizations are available in an open-source codebase, MPI Advance, which sits on top of MPI, allowing for optimizations to be added into existing codebases regardless of the system MPI install.
翻译:不规则通信常常限制并行应用的性能与可扩展性。通常,应用通过点对点通信独立实现不规则消息,任何优化直接嵌入应用内部,导致这些优化缺乏可移植性。由于单条消息的接口未提供所有待执行通信的整体信息,因此很难在MPI框架内优化点对点消息。然而,MPI 4标准中发布的持久化邻域集合API为在MPI库内部实现不规则通信的可移植优化提供了接口。本文提出在邻域集合内优化不规则通信的方法,分析在Hypre BoomerAMG等现有代码库中用邻域集合替换点对点通信的影响,并通过实验证明,在BoomerAMG求解过程中,使用优化后的邻域集合可使稀疏矩阵-向量乘法加速比最高达1.32倍。作者分析了邻域集合的多种实现,包括简单封装标准点对点通信的标准实现,以及多种位置感知聚合实现。所有优化均在开源代码库MPI Advance中提供,该库基于MPI构建,允许在系统MPI安装环境无关的情况下将优化集成到现有代码库中。