The Fast Fourier Transform (FFT) is a fundamental numerical technique with widespread application in a range of scientific problems. As scientific simulations attempt to exploit exascale systems, there has been a growing demand for distributed FFT algorithms that can effectively utilize modern heterogeneous high-performance computing (HPC) systems. Conventional FFT algorithms commonly encounter performance bottlenecks, especially when run on heterogeneous platforms. Most distributed FFT approaches rely on static task distribution and require synchronization barriers, limiting scalability and impacting overall resource utilization. In this paper we present DaggerFFT, a distributed FFT framework, developed in Julia, that treats highly parallel FFT computations as a dynamically scheduled task graph. Each FFT stage operates on a separately defined distributed array. FFT operations are expressed as DTasks operating on pencil or slab partitioned DArrays. Each FFT stage owns its own DArray, and the runtime assigns DTasks across devices using Dagger's dynamic scheduler that uses work stealing. We demonstrate how DaggerFFT's dynamic scheduler can outperform state-of-the-art distributed FFT libraries on both CPU and GPU backends, achieving up to a 2.6x speedup on CPU clusters and up to a 1.35x speedup on GPU clusters. We have integrated DaggerFFT into Oceananigans.jl, a geophysical fluid dynamics framework, demonstrating that high-level, task-based runtimes can deliver both superior performance and modularity in large-scale, real-world simulations.
翻译:快速傅里叶变换(FFT)是一种基础数值技术,广泛应用于各类科学问题中。随着科学模拟试图利用百亿亿次计算系统,对能够有效利用现代异构高性能计算(HPC)系统的分布式FFT算法的需求日益增长。传统的FFT算法通常遇到性能瓶颈,尤其是在异构平台上运行时。大多数分布式FFT方法依赖于静态任务分配并需要同步屏障,这限制了可扩展性并影响了整体资源利用率。本文提出了DaggerFFT,一个用Julia开发的分布式FFT框架,它将高度并行的FFT计算视为动态调度的任务图。每个FFT阶段在独立定义的分布式数组上运行。FFT操作被表示为在铅笔或平板分区DArray上运行的DTask。每个FFT阶段拥有自己的DArray,运行时使用Dagger的动态调度器(采用工作窃取策略)将DTask分配到不同设备上。我们展示了DaggerFFT的动态调度器如何在CPU和GPU后端上优于最先进的分布式FFT库,在CPU集群上实现了高达2.6倍的加速,在GPU集群上实现了高达1.35倍的加速。我们已将DaggerFFT集成到地球物理流体动力学框架Oceananigans.jl中,证明了高层次、基于任务的运行时能够在大规模实际模拟中同时提供卓越的性能和模块化。