Parallel scan primitives compute element-wise inclusive or exclusive prefix sums of input vectors contributed by $p$ consecutively ranked processors under an associative, possibly expensive, binary operator $\oplus$. In message-passing systems with bounded, one-ported communication capabilities, at least $\lceil\log_2 p\rceil$ or $\lceil\log_2 (p-1)\rceil$ send-receive communication rounds are required to perform the scans. While there are well-known, simple algorithms for the inclusive scan that solve the problem in $\lceil\log_2 p\rceil$ send-receive communication rounds with $\lceil\log_2 p\rceil$ applications of the $\oplus$ operator, the exclusive scan is different and has been much less addressed. By considering natural invariants for the exclusive prefix sums problem, we present two different algorithms that are efficient in the number of communication rounds and in the number of applications of the $\oplus$ operator. The first algorithm consists of an inclusive scan phase and an exclusive scan phase and trades the number of communication rounds against the number of applications of the $\oplus$ operator. The smallest number of inclusive scan rounds with $q=\lceil\log_2 p\rceil$ rounds in total is $q'\geq q-\log_2(2^q-p+1)$. The other algorithm is a modification of a round-optimal all-reduce algorithm, and the number of additional applications of the $\oplus$ operator is dependent on the number of bits set (popcount of) in $p-1$. Both algorithms are relevant for small(er) input vectors where performance is dominated by the number of communication rounds. For large input vectors, other (pipelined, fixed-degree tree) algorithms must be used.
翻译:并行扫描原语计算由$p$个连续排名处理器在可结合且可能代价高昂的二元运算符$\oplus$下对输入向量进行逐元素包含或独占前缀和。在具有有限单端口通信能力的消息传递系统中,执行扫描至少需要$\lceil\log_2 p\rceil$或$\lceil\log_2 (p-1)\rceil$次发送-接收通信轮次。尽管存在广为人知的简单包含扫描算法,可在$\lceil\log_2 p\rceil$次发送-接收通信轮次内通过$\lceil\log_2 p\rceil$次$\oplus$运算符应用解决问题,但独占扫描有所不同且研究较少。通过考虑独占前缀和问题的自然不变量,本文提出两种在通信轮次数和$\oplus$运算符应用次数上均高效的算法。第一种算法包含一个包含扫描阶段和一个独占扫描阶段,在通信轮次数与$\oplus$运算符应用次数之间进行权衡。包含扫描阶段的最小轮次数$q'$满足总轮次数$q=\lceil\log_2 p\rceil$时$q'\geq q-\log_2(2^q-p+1)$。另一种算法是对轮次最优的全归约算法的改进,其$\oplus$运算符的额外应用次数取决于$p-1$中置位比特的数量(即popcount)。两种算法均适用于输入向量较小的情况,此时性能由通信轮次数主导。对于大输入向量,必须采用其他(流水线化、固定度数树)算法。