We give optimally fast $O(\log p)$ time (per processor) algorithms for computing round-optimal broadcast schedules for message-passing parallel computing systems. This affirmatively answers the questions posed in Tr\"aff (2022). The problem is to broadcast $n$ indivisible blocks of data from a given root processor to all other processors in a (subgraph of a) fully connected network of $p$ processors with fully bidirectional, one-ported communication capabilities. In this model, $n-1+\lceil\log_2 p\rceil$ communication rounds are required. Our new algorithms compute for each processor in the network receive and send schedules each of size $\lceil\log_2 p\rceil$ that determine uniquely in $O(1)$ time for each communication round the new block that the processor will receive, and the already received block it has to send. Schedule computations are done independently per processor without communication. The broadcast communication subgraph is the same, easily computable, directed, $\lceil\log_2 p\rceil$-regular circulant graph used in Tr\"aff (2022) and elsewhere. We show how the schedule computations can be done in optimal time and space of $O(\log p)$, improving significantly over previous results of $O(p\log^2 p)$ and $O(\log^3 p)$. The schedule computation and broadcast algorithms are simple to implement, but correctness and complexity are not obvious. All algorithms have been implemented, compared to previous algorithms, and briefly evaluated on a small $36\times 32$ processor-core cluster.
翻译:我们给出了消息传递并行计算系统中计算轮数最优广播调度的最快$O(\log p)$时间(每处理器)算法。这肯定地回答了Träff(2022)中提出的问题。该问题要求从给定根处理器向全连通网络(或其子图)中$p$个处理器广播$n$个不可分割数据块,网络支持全双向单端口通信能力。在该模型下,需要$n-1+\lceil\log_2 p\rceil$轮通信。我们的新算法为网络中每个处理器计算大小为$\lceil\log_2 p\rceil的接收和发送调度表,能在$O(1)$时间内唯一确定每轮通信中处理器将接收的新数据块及其需要发送的已接收数据块。调度计算由各处理器独立完成,无需通信交互。广播通信子图采用与Träff(2022)及其它文献相同的、易于计算的有向$\lceil\log_2 p\rceil$-正则循环图。我们证明了调度计算可在$O(\log p)$的最优时间和空间复杂度内完成,相较于此前$O(p\log^2 p)$和$O(\log^3 p)$的结果有显著改进。调度计算与广播算法实现简单,但正确性与复杂度分析并非显而易见。所有算法均已实现,并与先前算法进行了对比,在$36\times 32$处理器核心集群上进行了简要评估。