Modern processors deliver higher throughput for lower-precision arithmetic than for higher-precision arithmetic. For matrix multiplication, the Ozaki scheme exploits this performance gap by splitting the inputs into lower-precision components and delegating the computation to optimized lower-precision routines. However, no similar approach exists for the fast Fourier transform (FFT). Here, we propose a method that computes target-precision FFTs using lower-precision FFTs by applying the Ozaki scheme to the cyclic convolution in the Bluestein FFT. The split component convolutions are computed exactly using the number theoretic transform (NTT), an FFT over a finite field, instead of floating-point FFTs, combined with the Chinese remainder theorem. We introduce an upper bound on the number of splits and an NTT-domain accumulation strategy to reduce the NTT call count. As a concrete implementation, we implement a double-precision FFT using 32-bit NTTs and confirm reduced relative error compared with those for FFTs based on FFTW and Triple-Single precision arithmetic, with stable error across FFT lengths, at most 96 NTT calls, or 64 NTT calls with NTT-domain accumulation. On an Intel Xeon Platinum 8468 for lengths $n=2^{10}$-$2^{18}$, the execution time is approximately 107-1315$\times$ that of FFTW's double-precision FFT, with NTTs accounting for approximately 80% of the total time.
翻译:现代处理器在低精度算术运算中的吞吐量高于高精度算术运算。对于矩阵乘法,Ozaki方案通过将输入拆分为低精度分量并将计算委托给优化的低精度例程来利用这种性能差距。然而,对于快速傅里叶变换(FFT),尚无类似方法。本文提出一种方法,通过将Ozaki方案应用于Bluestein FFT中的循环卷积,使用低精度FFT计算目标精度FFT。拆分后的分量卷积通过数论变换(NTT)(一种有限域上的FFT)结合中国剩余定理精确计算,而非使用浮点FFT。我们引入了拆分次数的上界和NTT域累加策略以减少NTT调用次数。作为具体实现,我们使用32位NTT实现了双精度FFT,并验证了与基于FFTW和三倍单精度算术的FFT相比相对误差降低,且误差在不同FFT长度下保持稳定,最多需96次NTT调用,或采用NTT域累加时仅需64次。在Intel Xeon Platinum 8468处理器上,针对长度$n=2^{10}$-$2^{18}$,执行时间约为FFTW双精度FFT的107-1315倍,其中NTT占总时间的约80%。