The rapid development of large language models (LLM) has greatly enhanced everyday applications. While many FPGA-based accelerators, with flexibility for fine-grained data control, exhibit superior speed and energy efficiency compared to GPUs, recent GPU-specific optimizations have diminished this advantage. When limited to arithmetic-based computation, FPGAs often underperform GPUs due to their comparatively fewer computational resources. To address this challenge, we exploit a key advantage of FPGAs over GPUs: abundant distributed on-chip memory embedded among computational units. We believe that shifting LLM inference from arithmetic-based to memory-based computations through table lookups can improve the efficiency on FPGAs to compete with GPUs. However, existing methods are inefficient or unable to scale and deploy language models due to algorithm and architecture design limitations. This paper introduces \textbf{LUT-LLM}, the first FPGA accelerator that deploy 1B+ language model with memory-based computation, leveraging vector quantization. We construct a performance model, evaluate multiple quantization schemes, and identify activation-weight vector co-quantization as the most effective approach. To support this scheme, LUT-LLM features (1) bandwidth-aware parallel centroid search to reduce decoding latency, (2) efficient 2D table lookups, and (3) a spatial-temporal hybrid design to reduce data caching for a higher throughput table lookup. We develop a training recipe that converts existing models to support table lookups with high accuracy and prototype LUT-LLM for Qwen 3 1.7B model on the AMD V80 FPGA, reducing arithmetic operations by $4\times$ and achieving a $1.10\sim3.29\times$ faster generation speed and a $3.05\sim 6.60\times$ higher energy efficiency than GPUs.
翻译:大语言模型的快速发展极大地增强了日常应用。虽然许多基于FPGA的加速器凭借其细粒度数据控制的灵活性,在速度和能效方面相较于GPU展现出优越性,但近期针对GPU的优化已削弱了这一优势。当局限于基于算术的计算时,由于计算资源相对较少,FPGA性能往往不及GPU。为应对这一挑战,我们利用了FPGA相对于GPU的关键优势:嵌入在计算单元间的丰富分布式片上存储器。我们认为,通过查表法将大语言模型推理从基于算术的计算转向基于存储的计算,能够提升FPGA效率以与GPU竞争。然而,现有方法因算法与架构设计局限,在扩展和部署语言模型时效率低下或不可行。本文提出\textbf{LUT-LLM},这是首个利用向量量化、基于存储计算部署10亿级以上语言模型的FPGA加速器。我们构建了性能模型,评估了多种量化方案,并确定激活-权重向量联合量化是最有效的方法。为支持该方案,LUT-LLM具备:(1)带宽感知的并行质心搜索以降低解码延迟,(2)高效二维查表,(3)时空混合设计以减少数据缓存,从而实现更高吞吐量的查表操作。我们开发了一种训练方法,将现有模型转化为支持高精度查表的形式,并在AMD V80 FPGA上基于Qwen 3 1.7B模型实现了LUT-LLM原型,将算术运算减少了$4\times$,生成速度比GPU快$1.10\sim3.29\times$,能效比GPU高$3.05\sim6.60\times$。