RWKV is a modern RNN architecture that approaches the performance of Transformers, with the advantage of processing long contexts at a linear memory cost. However, its sequential computation pattern struggles to efficiently leverage GPU parallelism, which leads to low compute resource utilization. Furthermore, frequent off-chip weight accesses create a memory bottleneck. To address these challenges, we propose HFRWKV, an FPGA-based hardware accelerator specifically designed for RWKV. Within the matrix operation module, we propose a novel hardware-friendly hybrid-precision quantization strategy, which enhances performance while maintaining acceptable accuracy. For the complex operations including exponentiation and division, we introduce a method featuring reusable architectures combined with lookup tables or piecewise linear approximation, which is algorithmically refined to effectively balance precision and hardware resource consumption. Based on this foundation, we adopt a fully on-chip computing system integrating parallel matrix-vector processing array and an efficient pipeline architecture. Through computation reordering and chunked double buffering, it effectively eliminates data transfer bottlenecks and improves overall throughput. We implement HFRWKV on the Alveo U50 and U280 platform. Experimental results show that compared to a CPU, a throughput improvement of 63.48$\times$ and an energy efficiency improvement of 139.17$\times$. Compared to GPUs, achieves a throughput improvement of 32.33$\times$ and an energy efficiency improvement of 171.36$\times$.
翻译:RWKV是一种现代RNN架构,其性能接近Transformer,并具有以线性内存成本处理长上下文序列的优势。然而,其顺序计算模式难以高效利用GPU并行性,导致计算资源利用率低下。此外,频繁的片外权重访问造成了内存瓶颈。为应对这些挑战,我们提出了HFRWKV,一种基于FPGA、专为RWKV设计的硬件加速器。在矩阵运算模块中,我们提出了一种新颖的硬件友好型混合精度量化策略,在保持可接受精度的同时提升了性能。针对指数与除法等复杂运算,我们引入了一种结合查找表或分段线性近似的可重用架构方法,该方法经过算法层面的优化,能有效平衡精度与硬件资源消耗。在此基础上,我们采用了集成并行矩阵向量处理阵列与高效流水线架构的全片上计算系统。通过计算重排序与分块双缓冲技术,该系统有效消除了数据传输瓶颈并提升了整体吞吐量。我们在Alveo U50和U280平台上实现了HFRWKV。实验结果表明,与CPU相比,其吞吐量提升了63.48$\times$,能效提升了139.17$\times$;与GPU相比,实现了32.33$\times$的吞吐量提升和171.36$\times$的能效提升。