We give a fast(er), communication-free, parallel construction of optimal communication schedules that allow broadcasting of $n$ distinct blocks of data from a root processor to all other processors in $1$-ported, $p$-processor networks with fully bidirectional communication. For any $p$ and $n$, broadcasting in this model requires $n-1+\lceil\log_2 p\rceil$ communication rounds. In contrast to other constructions, all processors follow the same, circulant graph communication pattern, which makes it possible to use the schedules for the allgather (all-to-all-broadcast) operation as well. The new construction takes $O(\log^3 p)$ time steps per processor, each of which can compute its part of the schedule independently of the other processors in $O(\log p)$ space. The result is a significant improvement over the sequential $O(p \log^2 p)$ time and $O(p\log p)$ space construction of Tr\"aff and Ripke (2009) with considerable practical import. The round-optimal schedule construction is then used to implement communication optimal algorithms for the broadcast and (irregular) allgather collective operations as found in MPI (the \emph{Message-Passing Interface}), and significantly and practically improves over the implementations in standard MPI libraries (\texttt{mpich}, OpenMPI, Intel MPI) for certain problem ranges. The application to the irregular allgather operation is entirely new.
翻译:我们提出了一种快速、无通信、并行的最优通信调度构造方法,支持在1端口、$p$处理器全双向通信网络中,将$n$个不同数据块从根处理器广播至所有其他处理器。对于任意$p$和$n$,该模型下的广播操作需要$n-1+\lceil\log_2 p\rceil$个通信轮次。与其他构造方法不同,所有处理器遵循相同的循环图通信模式,这使得该调度亦可应用于全收集(all-to-all-broadcast)操作。新构造中每个处理器仅需$O(\log^3 p)$时间步,且可在独立于其他处理器的$O(\log p)$空间内计算其调度部分。与Träff和Ripke(2009)提出的顺序$O(p \log^2 p)$时间与$O(p\log p)$空间构造相比,本方法实现了显著改进并具有重要实践价值。该轮次最优调度构造随后被用于实现MPI(消息传递接口)中广播与(非规则)全收集集合操作的最优通信算法,并在特定问题规模上显著优于标准MPI库(如\texttt{mpich}、OpenMPI、Intel MPI)的实现。对非规则全收集操作的拓展应用尚属首次。