The need for long-context reasoning has led to alternative neural network architectures besides Transformers and self-attention, a popular model being Hyena, which employs causal 1D-convolutions implemented with FFTs. Long convolutions enable efficient global context mixing, but requirements for intermediate results exceed the 2-3 MB Block RAM capacity of FPGAs. We present a chunked FFT convolution approach enabling 450K length sequence by 450K length filter convolutions on an Alveo U200 FPGA with 2.8 MB BRAM through chunking and overlap-add reconstruction. We find that throughput scales proportionally with chunk size while degrading minimally by 7% for our longest sequences, demonstrating that careful memory management enables deployment of long-context primitives on edge FPGAs without sacrificing performance.
翻译:长上下文推理的需求催生了除Transformer与自注意力机制外的替代性神经网络架构,其中Hyena模型因其采用基于FFT实现的因果一维卷积而备受关注。长卷积能够实现高效的全局上下文混合,但其对中间结果的需求往往超出FPGA仅2-3 MB的Block RAM容量。本文提出一种分块FFT卷积方法,通过在Alveo U200 FPGA(配备2.8 MB BRAM)上采用分块与重叠相加重构技术,实现了序列长度45万与滤波器长度45万的卷积运算。实验表明,吞吐量与分块大小成比例增长,即使在最长序列场景下性能衰减也仅为7%。这证明通过精细的内存管理,可在边缘FPGA上部署长上下文原语且无需牺牲计算性能。