We present FlexLLM, a composable High-Level Synthesis (HLS) library for rapid development of domain-specific LLM accelerators. FlexLLM exposes key architectural degrees of freedom for stage-customized inference, enabling hybrid designs that tailor temporal reuse and spatial dataflow differently for prefill and decode, and provides a comprehensive quantization suite to support accurate low-bit deployment. Using FlexLLM, we build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code. The system includes: (1) a stage-customized accelerator with hardware-efficient quantization (12.68 WikiText-2 PPL) surpassing SpinQuant baseline, and (2) a Hierarchical Memory Transformer (HMT) plug-in for efficient long-context processing. On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29$\times$ end-to-end speedup, 1.64$\times$ higher decode throughput, and 3.14$\times$ better energy efficiency than an NVIDIA A100 GPU (7nm) running BF16 inference; projected results on the V80 FPGA at 7nm reach 4.71$\times$, 6.55$\times$, and 4.13$\times$, respectively. In long-context scenarios, integrating the HMT plug-in reduces prefill latency by 23.23$\times$ and extends the context window by 64$\times$, delivering 1.10$\times$/4.86$\times$ lower end-to-end latency and 5.21$\times$/6.27$\times$ higher energy efficiency on the U280/V80 compared to the A100 baseline. FlexLLM thus bridges algorithmic innovation in LLM inference and high-performance accelerators with minimal manual effort.
翻译:我们提出了FlexLLM,一个用于快速开发领域专用大语言模型(LLM)加速器的可组合高层次综合(HLS)库。FlexLLM为阶段定制化推理提供了关键的架构自由度,支持为预填充和解码阶段分别定制时间复用和空间数据流的混合设计,并提供了一个全面的量化套件以支持精确的低比特部署。使用FlexLLM,我们在不到两个月的时间内,仅用1千行代码为Llama-3.2 1B模型构建了一个完整的推理系统。该系统包括:(1)一个采用硬件高效量化(12.68 WikiText-2 PPL)的阶段定制化加速器,其性能超越了SpinQuant基线;(2)一个用于高效长上下文处理的分层内存Transformer(HMT)插件。在16nm工艺的AMD U280 FPGA上,该加速器与运行BF16推理的NVIDIA A100 GPU(7nm)相比,实现了1.29倍端到端加速、1.64倍更高的解码吞吐量和3.14倍更优的能效;在7nm工艺的V80 FPGA上的预估结果分别达到4.71倍、6.55倍和4.13倍。在长上下文场景中,集成HMT插件可将预填充延迟降低23.23倍,并将上下文窗口扩展64倍,与A100基线相比,在U280/V80上分别实现了1.10倍/4.86倍更低的端到端延迟和5.21倍/6.27倍更高的能效。因此,FlexLLM以最少的人工工作量,将大语言模型推理的算法创新与高性能加速器设计连接起来。