Long-sequence state-space models (SSMs) such as Hyena and Mamba replace the quadratic complexity of self-attention with more efficient FFT and scan operations. However, modern accelerators like GPUs are poorly suited to these non-GEMM workloads due to rigid execution models and specialization for dense matrix operations. This paper proposes architectural extensions to a baseline Reconfigurable Dataflow Unit (RDU) that efficiently support FFT-based and scan-based SSMs. By introducing lightweight interconnect enhancements within compute tiles, the extended RDU enables spatial mapping of FFT and scan dataflows with less than 1% area and power overhead. The resulting architecture achieves a 5.95X speedup over the GPU and a 1.95X speedup over the baseline RDU for Hyena, and a 2.12X and 1.75X speedup over the GPU and baseline RDU, respectively, for Mamba.
翻译:长序列状态空间模型(SSMs),如Hyena和Mamba,利用更高效的FFT和扫描操作替代了自注意力机制的二次复杂度。然而,由于现代加速器(如GPU)具有僵化的执行模型并专为稠密矩阵运算优化,它们并不适合处理此类非通用矩阵乘法(GEMM)工作负载。本文提出了一种对基线可重构数据流单元(RDU)的架构扩展,以高效支持基于FFT和基于扫描的SSMs。通过在计算单元内引入轻量级的互连增强,扩展后的RDU能够以低于1%的面积和功耗开销实现FFT和扫描数据流的空间映射。最终架构在运行Hyena模型时,相比GPU实现了5.95倍的加速,相比基线RDU实现了1.95倍的加速;在运行Mamba模型时,相比GPU和基线RDU分别实现了2.12倍和1.75倍的加速。