We present a parallel algorithm for the fast Fourier transform (FFT) in higher dimensions. This algorithm generalizes the cyclic-to-cyclic one-dimensional parallel algorithm to a cyclic-to-cyclic multidimensional parallel algorithm while retaining the property of needing only a single all-to-all communication step. This is under the constraint that we use at most $\sqrt{N}$ processors for an FFT on an array with a total of $N$ elements, irrespective of the dimension $d$ or the shape of the array. The only assumption we make is that $N$ is sufficiently composite. Our algorithm starts and ends in the same data distribution. We present our multidimensional implementation FFTU which utilizes the sequential FFTW program for its local FFTs, and which can handle any dimension $d$. We obtain experimental results for $d\leq 5$ using MPI on up to 4096 cores of the supercomputer Snellius, comparing FFTU with the parallel FFTW program and with PFFT and heFFTe. These results show that FFTU is competitive with the state of the art and that it allows one to use a larger number of processors, while keeping communication limited to a single all-to-all operation. For arrays of size $1024^3$ and $64^5$, FFTU achieves a speedup of a factor 149 and 176, respectively, on 4096 processors.
翻译:我们提出了一种用于高维快速傅里叶变换(FFT)的并行算法。该算法将单维循环到循环并行算法推广至多维循环到循环并行算法,同时保留了仅需单次全对全通信步骤的特性。其约束条件是:对于总元素数为$N$的数组,无论其维度$d$或形状如何,我们最多使用$\sqrt{N}$个处理器。我们唯一的假设是$N$具有充分可分解性。该算法在起始和结束时刻保持相同的数据分布。我们提出了多维实现FFTU,该实现利用顺序FFTW程序进行局部FFT计算,且可处理任意维度$d$。我们在Snellius超级计算机上使用MPI对$d\leq 5$的情况进行了实验,将FFTU与并行FFTW程序、PFFT及heFFTe进行了对比。结果表明FFTU与当前最优方法具有竞争力,并且能够在保持通信仅限于单次全对全操作的同时,支持使用更多处理器。对于规模为$1024^3$和$64^5$的数组,FFTU在4096个处理器上分别实现了149倍和176倍的加速比。