Gated DeltaNet (GDN) is a linear attention mechanism that replaces the growing KV cache with a fixed-size recurrent state. Hybrid LLMs like Qwen3-Next use 75% GDN layers and achieve competitive accuracy to attention-only models. However, at batch-1, GDN decode is memory-bound on GPUs since the full recurrent state must be round-tripped through HBM every token. We show that this bottleneck is architectural, not algorithmic, as all subquadratic sequence models exhibit arithmetic intensities below 1 FLOP/B at decode time, making them more memory-bound than standard Transformers. We present an FPGA accelerator that eliminates this bottleneck by holding the full 2 MB recurrent state persistently in on-chip BRAM, converting the workload from memory-bound to compute-bound. Our design fuses the GDN recurrence into a five-phase pipelined datapath that performs only one read and one write pass over each state matrix per token, exploits Grouped Value Attention for paired-head parallelism, and overlaps preparation, computation, and output storage via dataflow pipelining. We explore four design points on an AMD Alveo U55C using Vitis HLS, varying head-level parallelism from 2 to 16 value-heads per iteration. Our fastest configuration achieves 63 $μ$s per token, 4.5$\times$ faster than the GPU reference on NVIDIA H100 PCIe. Post-implementation power analysis reports 9.96 W on-chip, yielding up to 60$\times$ greater energy efficiency per token decoded.
翻译:门控DeltaNet(GDN)是一种线性注意力机制,它通过固定大小的循环状态替代了不断增长的KV缓存。像Qwen3-Next这样的混合LLM使用了75%的GDN层,并取得了与纯注意力模型相媲美的准确率。然而,在批大小为1的情况下,GDN解码在GPU上是内存受限的,因为每个token都需要将完整的循环状态通过HBM往返传输一次。我们证明这一瓶颈是架构性的,而非算法性的,因为所有次二次序列模型在解码时的算术强度均低于1 FLOP/B,使其比标准Transformer更受内存限制。我们提出了一种FPGA加速器,通过将完整的2 MB循环状态持久保存在片上的BRAM中,消除了这一瓶颈,将工作负载从内存受限转变为计算受限。我们的设计将GDN循环融合到一个五级流水线数据通路中,每个token对每个状态矩阵仅执行一次读取和一次写入操作,利用分组值注意力实现配对头的并行化,并通过数据流流水线重叠准备、计算和输出存储阶段。我们在AMD Alveo U55C上使用Vitis HLS探索了四种设计点,将每次迭代的头级并行度从2个值头调整到16个值头。我们最快的配置实现了每个token 63 $μ$s的处理速度,比NVIDIA H100 PCIe上的GPU参考实现快4.5$\times$。实现后的功耗分析显示片上功耗为9.96 W,使每个解码token的能效提升高达60$\times$。