The Discrete Fourier Transform (DFT) is essential for various applications ranging from signal processing to convolution and polynomial multiplication. The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time complexity from the naive O(n^2) to O(n log n), and recent works have sought further acceleration through parallel architectures such as GPUs. Unfortunately, accelerators such as GPUs cannot exploit their full computing capabilities as memory access becomes the bottleneck. Therefore, this paper accelerates the FFT algorithm using digital Processing-in-Memory (PIM) architectures that shift computation into the memory by exploiting physical devices capable of storage and logic (e.g., memristors). We propose an O(log n) in-memory FFT algorithm that can also be performed in parallel across multiple arrays for high-throughput batched execution, supporting both fixed-point and floating-point numbers. Through the convolution theorem, we extend this algorithm to O(log n) polynomial multiplication - a fundamental task for applications such as cryptography. We evaluate FourierPIM on a publicly-available cycle-accurate simulator that verifies both correctness and performance, and demonstrate 5-15x throughput and 4-13x energy improvement over the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial multiplication.
翻译:离散傅里叶变换(DFT)对于从信号处理到卷积和多项式乘法等各类应用至关重要。开创性的快速傅里叶变换(FFT)算法将DFT的时间复杂度从朴素O(n^2)降低至O(n log n),而近期研究试图通过GPU等并行架构进一步加速。然而,由于内存访问成为瓶颈,GPU等加速器无法发挥其全部计算能力。为此,本文利用数字处理-在-内存(PIM)架构加速FFT算法——该架构通过利用兼具存储与逻辑功能的物理器件(如忆阻器)将计算移入内存。我们提出一种O(log n)内存内FFT算法,该算法还能跨多个阵列并行执行以实现高吞吐批处理,并支持定点数与浮点数。通过卷积定理,我们将该算法扩展至O(log n)多项式乘法——这一基本任务对密码学等应用至关重要。我们在公开的周期精确模拟器上对FourierPIM进行评估,验证其正确性与性能,并证明在FFT和多项式乘法任务中,相较于最先进GPU上的NVIDIA cuFFT库,可获得5-15倍的吞吐量提升与4-13倍的能效改善。