The Fast Fourier Transform (FFT), as a core computation in a wide range of scientific applications, is increasingly threatened by reliability issues. In this paper, we introduce TurboFFT, a high-performance FFT implementation equipped with a two-sided checksum scheme that detects and corrects silent data corruptions at computing units efficiently. The proposed two-sided checksum addresses the error propagation issue by encoding a batch of input signals with different linear combinations, which not only allows fast batched error detection but also enables error correction on-the-fly instead of recomputing. We explore two-sided checksum designs at the kernel, thread, and threadblock levels, and provide a baseline FFT implementation competitive to the state-of-the-art, closed-source cuFFT. We demonstrate a kernel fusion strategy to mitigate and overlap the computation/memory overhead introduced by fault tolerance with underlying FFT computation. We present a template-based code generation strategy to reduce development costs and support a wide range of input sizes and data types. Experimental results on an NVIDIA A100 server GPU and a Tesla Turing T4 GPU demonstrate TurboFFT offers a competitive or superior performance compared to the closed-source library cuFFT. TurboFFT only incurs a minimum overhead (7\% to 15\% on average) compared to cuFFT, even under hundreds of error injections per minute for both single and double precision. TurboFFT achieves a 23\% improvement compared to existing fault tolerance FFT schemes.
翻译:快速傅里叶变换(FFT)作为众多科学应用中的核心计算,日益受到可靠性问题的威胁。本文提出TurboFFT——一种配备双侧校验和方案的高性能FFT实现,能够高效检测并纠正计算单元中的静默数据损坏。所提出的双侧校验和通过对一批输入信号进行不同线性组合编码,解决了误差传播问题,不仅支持快速批量误差检测,还能实现即时纠错而无需重新计算。我们在内核级、线程级和线程块级探索了双侧校验和设计,并提供了与当前最先进的闭源cuFFT相当的基础FFT实现。我们提出一种内核融合策略,以减轻并重叠容错机制引入的计算/内存开销与底层FFT计算。我们还提出基于模板的代码生成策略,以降低开发成本并支持广泛的输入尺寸和数据类型。在NVIDIA A100服务器GPU和Tesla Turing T4 GPU上的实验结果表明,TurboFFT的性能与闭源库cuFFT相当甚至更优。即使每分钟进行数百次错误注入(单精度和双精度均适用),TurboFFT相较于cuFFT仅引入最低开销(平均7%至15%)。与现有容错FFT方案相比,TurboFFT实现了23%的性能提升。