Serving generative inference of the large language model is a crucial component of contemporary AI applications. This paper focuses on deploying such services in a heterogeneous and cross-datacenter setting to mitigate the substantial inference costs typically associated with a single centralized datacenter. Towards this end, we propose HexGen, a flexible distributed inference engine that uniquely supports the asymmetric partition of generative inference computations over both tensor model parallelism and pipeline parallelism and allows for effective deployment across diverse GPUs interconnected by a fully heterogeneous network. We further propose a sophisticated scheduling algorithm grounded in constrained optimization that can adaptively assign asymmetric inference computation across the GPUs to fulfill inference requests while maintaining acceptable latency levels. We conduct an extensive evaluation to verify the efficiency of HexGen by serving the state-of-the-art Llama-2 (70B) model. The results suggest that HexGen can choose to achieve up to 2.3 times lower latency deadlines or tolerate up to 4 times more request rates compared with the homogeneous baseline given the same budget.
翻译:大语言模型的生成式推理服务是当代人工智能应用的关键组成部分。本文重点研究在异构且跨数据中心的场景中部署此类服务,以缓解通常与单一集中式数据中心相关的高昂推理成本。为此,我们提出了HexGen,一个灵活的分布式推理引擎。它独特地支持在张量模型并行与流水线并行两种模式下对生成式推理计算进行非对称划分,并能有效地部署在由完全异构网络互连的多样化GPU集群上。我们进一步提出了一种基于约束优化的复杂调度算法,该算法能够自适应地在GPU间分配非对称推理计算,以满足推理请求,同时维持可接受的延迟水平。我们通过部署最先进的Llama-2(70B)模型进行了广泛的评估,以验证HexGen的效率。结果表明,在相同预算下,与同构基线相比,HexGen可以选择实现最多降低2.3倍的延迟截止时间,或者容忍最多高出4倍的请求率。