The high computational and memory requirements of large language model (LLM) inference traditionally make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. Through a linear programming optimizer, it searches for efficient patterns to store and access tensors. FlexGen further compresses these weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at https://github.com/FMInference/FlexGen
翻译:大语言模型(LLM)推理通常需要极高的计算和内存资源,传统上只能通过多块高端加速器实现。针对批处理延迟不敏感任务日益增长的需求,本文首次探索在有限资源(如单块商用GPU)条件下实现LLM的高通量推理。我们提出FlexGen——一种面向有限GPU内存环境的高通量生成引擎。通过聚合GPU、CPU及磁盘的内存与计算能力,FlexGen可根据不同硬件资源约束灵活配置。借助线性规划优化器,系统自动搜索张量存储与访问的高效模式,并将模型权重及注意力缓存压缩至4比特,且精度损失可忽略不计。这些技术扩展了FlexGen的批量大小选择空间,从而显著提升最大吞吐量。实验表明,在单块16GB GPU上运行OPT-175B模型时,FlexGen相比最先进的卸载系统实现了显著更高的吞吐量,首次在有效批处理大小为144时达到1 tokens/s的生成速率。在HELM基准测试中,FlexGen可在21小时内使用16GB GPU完成30B模型在7个典型子场景的评估。代码已开源:https://github.com/FMInference/FlexGen