Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are largely served using uniform high-caliber GPUs nowadays, utilizing a heterogeneous cluster with a mix of available high- and low-capacity GPUs can potentially substantially reduce the serving cost. There is a lack of designs to support efficient LLM serving using a heterogeneous cluster, while the current solutions focus on model partition and uniform compression among homogeneous devices. This paper proposes LLM-PQ, a system that advocates adaptive model quantization and phase-aware partition to improve LLM serving efficiency on heterogeneous GPU clusters. We carefully decide on mixed-precision model quantization together with phase-aware model partition and micro-batch sizing in distributed LLM serving with an efficient algorithm, to greatly enhance inference throughput while fulfilling user-specified model quality targets. Extensive experiments on production inference workloads in 11 different clusters demonstrate that LLM-PQ achieves up to 2.88x (2.26x on average) throughput improvement in inference, showing great advantages over state-of-the-art works.
翻译:近期大语言模型的突破性进展展示了其在多种任务上的卓越性能。然而,LLM的巨大规模导致运行模型所需的资源需求和成本极高。尽管目前模型主要使用统一的高性能GPU进行服务,但利用由高、低容量GPU混合构成的异构集群,可显著降低服务成本。目前尚缺乏支持在异构集群上高效服务LLM的设计方案,现有解决方案主要聚焦于同构设备间的模型分区与统一压缩。本文提出LLM-PQ系统,该系统倡导自适应模型量化与阶段感知分区,以提升异构GPU集群上LLM的服务效率。我们通过高效算法,在分布式LLM服务中联合决策混合精度模型量化、阶段感知分区及微批量大小,从而在满足用户指定模型质量目标的前提下显著提升推理吞吐量。在11种不同集群的生产推理工作负载上的大量实验表明,LLM-PQ的推理吞吐量提升最高达2.88倍(平均2.26倍),相较现有最优方法具有显著优势。