The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at https://github.com/FMInference/FlexGen
翻译:大语言模型推理所需的高计算和内存资源,使其仅能通过多个高端加速器实现。受批处理模式下延迟不敏感任务的新兴需求驱动,本文首次系统研究了利用有限资源(如单个普通GPU)实现高吞吐LLM推理的方法。我们提出FlexGen——一种在GPU内存受限条件下运行大语言模型的高吞吐生成引擎。通过聚合GPU、CPU和磁盘的内存与计算资源,FlexGen可在多种硬件资源约束下灵活配置。通过求解线性规划问题,该引擎自动搜索高效的张量存储与访问模式。此外,FlexGen将模型权重与注意力缓存压缩至4比特,且精度损失可忽略。这些技术使得FlexGen能支持更大范围的批次大小选择,从而显著提升最大吞吐量。实验表明,在单块16GB GPU上运行OPT-175B模型时,FlexGen相比现有最先进的卸载系统实现了显著更高的吞吐量,首次以144的有效批次大小达到1 token/s的生成吞吐量。在HELM基准测试中,FlexGen可在21小时内完成对30B模型在7个代表性子场景下的评测(基于16GB GPU)。代码开源地址:https://github.com/FMInference/FlexGen