Partitioned communication was introduced in MPI 4.0 as a user-friendly interface to support pipelined communication patterns, particularly common in the context of MPI+threads. It provides the user with the ability to divide a global buffer into smaller independent chunks, called partitions, which can then be communicated independently. In this work we first model the performance gain that can be expected when using partitioned communication. Next, we describe the improvements we made to \mpich{} to enable those gains and provide a high-quality implementation of MPI partitioned communication. We then evaluate partitioned communication in various common use cases and assess the performance in comparison with other MPI point-to-point and one-sided approaches. Specifically, we first investigate two scenarios commonly encountered for small partition sizes in a multithreaded environment: thread contention and overhead of using many partitions. We propose two solutions to alleviate the measured penalty and demonstrate their use. We then focus on large messages and the gain obtained when exploiting the delay resulting from computations or load imbalance. We conclude with our perspectives on the benefits of partitioned communication and the various results obtained.
翻译:分区通信作为MPI 4.0中引入的一种用户友好接口,用于支持流水线通信模式,尤其在MPI+线程场景中普遍使用。它允许用户将全局缓冲区划分为多个较小的独立数据块(称为分区),这些分区可以独立进行通信。本文首先对使用分区通信时预期获得的性能增益进行建模。随后,我们描述了为\mpich{}实现的改进,以支持这些性能增益并提供高质量的分区通信MPI实现。接着,我们在多种常见用例场景下对分区通信进行评估,并将其性能与其他MPI点对点及单边通信方法进行对比。具体而言,我们首先研究多线程环境中小分区尺寸常见的两种场景:线程竞争和使用大量分区的开销。我们提出两种解决方案以缓解测得的性能损失,并展示其应用效果。随后,我们聚焦大消息场景,分析利用计算或负载不平衡导致的延迟所能获得的性能增益。最后,我们总结了对分区通信优势的见解及所获各项结果。