SkipOPU: An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their inference efficiency remains a critical bottleneck due to rapidly growing parameters. Recent advances in dynamic computation allocation address this challenge by exploiting the highly uneven contributions of different tokens and layers, enabling selective execution that significantly reduces redundant computation while preserving model accuracy. However, existing hardware platforms and accelerators are primarily optimized for uniform, static execution, limiting their ability to efficiently support such dynamic inference patterns. In this work, we propose SkipOPU, an FPGA-based overlay processor that dynamically allocates computation across tokens and layers with high flexibility through a lightweight routing mechanism. First, we decouple reduction operations from element-wise computation in nonlinear modules and perform reductions incrementally, which enables both stages to be fused with adjacent linear operations (router or matrix multiplication) for effective latency hiding. Second, motivated by asymmetric sensitivity to numerical precision between activation and weight, we design a PE array that efficiently supports float-fixed hybrid execution. A novel DSP overpacking technique is introduced to maximize hardware utilization while minimizing resource overhead. Finally, we develop a proactive on-chip KV history buffer that exploits cross-layer KV invariance of pruned tokens, eliminating irregular HBM accesses during decoding and supplementing off-chip bandwidth through high-locality on-chip reuse. Experimental results demonstrate that SkipOPU on an AMD U280 FPGA outperforms GPU and other FPGA-based accelerators by 1.23x-3.83x in bandwidth efficiency for LLMs inference with dynamic computation allocation and can reduce up to 25.4% KV storage overhead across varying sequence lengths.

翻译：大语言模型（LLM）在各种任务中取得了显著性能，但由于参数快速增长，其推理效率仍是关键瓶颈。动态计算分配的最新进展通过利用不同词元与层贡献高度不均的特性，实现了选择性执行，从而在保持模型精度的同时显著减少冗余计算。然而，现有硬件平台和加速器主要针对均匀静态执行进行优化，限制了其高效支持此类动态推理模式的能力。本文提出SkipOPU，一种基于FPGA的覆盖处理器，其通过轻量级路由机制在词元与层间动态分配计算，具备高度灵活性。首先，我们将非线性模块中的规约操作与逐元素计算解耦，并采用增量式规约执行，使得两个阶段均能与相邻线性操作（路由或矩阵乘法）融合以实现有效的延迟隐藏。其次，基于激活值与权重对数值精度敏感度的不对称性，我们设计了高效支持浮点-定点混合执行的PE阵列。引入一种新颖的DSP超封装技术，在最大化硬件利用率的同时最小化资源开销。最后，我们开发了一种主动式片上KV历史缓冲区，利用被剪枝词元的跨层KV不变性，消除解码过程中的不规则高带宽存储器访问，并通过高局部性的片上重用补充片外带宽。实验结果表明，在AMD U280 FPGA上部署的SkipOPU，在支持动态计算分配的LLM推理任务中，其带宽效率较GPU及其他基于FPGA的加速器提升1.23-3.83倍，并能在不同序列长度下降低高达25.4%的KV存储开销。

相关内容

FPGA

关注 18

FPGA：ACM/SIGDA International Symposium on Field-Programmable Gate Arrays。 Explanation：ACM/SIGDA现场可编程门阵列国际研讨会。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/fpga/

大语言模型高效推理中的动态模型路由与级联技术综述

专知会员服务

14+阅读 · 3月6日

【AAAI2026】NeSTR：一种用于大型语言模型的神经-符号可溯因框架，用于时间推理

专知会员服务

17+阅读 · 2025年12月10日

LaCache：用于高效长上下文建模的大语言模型梯状KV缓存机制

专知会员服务

11+阅读 · 2025年7月23日

什么是上下文工程？中科院计算所等《大语言模型的上下文工程》综述

专知会员服务

43+阅读 · 2025年7月18日