The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth Processing Unit (HPU), a memoryintensive co-processor that enhances GPU resource utilization during large-batched LLM inference. By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency. Also, the HPU, as an add-on card, scales out to accommodate surging memory demands driven by large batch sizes and extended sequence lengths. In this paper, we show the HPU prototype implemented with PCIe-based FPGA cards mounted on a GPU system. Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system, providing scalability without increasing the number of GPUs.
翻译:注意力层作为基于Transformer架构的大语言模型(LLM)的核心组件,因其较低的计算强度及键值(KV)缓存对内存的巨大需求,导致当前GPU系统存在效率瓶颈。本文提出一种高带宽处理单元(HPU),这是一种面向内存密集型任务的协处理器,旨在提升大规模批处理LLM推理过程中GPU的资源利用率。通过将内存受限的操作卸载至HPU,GPU得以专注于计算密集型任务,从而提升整体系统效率。此外,HPU作为扩展卡,可通过横向扩展来应对因大批次规模和长序列长度而激增的内存需求。本文展示了基于PCIe接口的FPGA扩展卡在GPU系统上实现的HPU原型。我们提出的GPU-HPU异构系统相较于纯GPU系统,在性能上最高可提升4.1倍,能效比最高可提升4.6倍,实现了无需增加GPU数量即可扩展系统能力的目标。