Serving foundation model inference is a pivotal component of contemporary AI applications, where this service is usually hosted in a centralized data center on a group of homogeneous high-performance GPUs. In this paper, we explore how to deploy such a service in a heterogeneous environment in terms of both computation capacity and network connection as an alternative to reduce the high inference cost. We propose HexGen, a distributed inference engine that supports asymmetric partitioning of the inference computation according to tensor model parallelism and pipeline parallelism. HexGen can be deployed with a set of different GPUs connected by a fully heterogeneous network, where the key technique contribution is a scheduling algorithm that allocates the asymmetric inference tasklets among these GPUs connected by different networks. We define the scheduling problem as a constrained optimization problem and further propose an efficient evolutionary algorithm to find the optimal allocation strategy. We conduct an extensive empirical study to evaluate the efficiency of HexGen by serving the state-of-the-art Llama-2 (70B) model. The experimental results suggest that HexGen can choose to achieve up to 2.3 times lower latency deadlines or tolerate up to 4 times more traffic request rates compared with the homogeneous baseline given the same budget. Our implementation is available at https://github.com/Relaxed-System-Lab/HexGen.
翻译:摘要:服务基础模型推理是当代人工智能应用的关键组成部分,该服务通常部署在集中式数据中心的同构高性能GPU集群上。本文探索如何在计算能力和网络连接均呈异构的环境中部署此类服务,以降低高昂的推理成本。我们提出HexGen,这是一款分布式推理引擎,支持基于张量模型并行与流水线模型并行的非对称推理计算划分。HexGen可部署于由完全异构网络连接的不同GPU集群,其关键技术贡献在于一种调度算法,该算法能将非对称推理任务单元合理分配给这些通过不同网络连接的GPU。我们将该调度问题定义为约束优化问题,并提出一种高效的进化算法来寻找最优分配策略。我们通过服务当前最先进的Llama-2(700亿参数)模型进行广泛实证研究以评估HexGen的效率。实验结果表明,在相同预算下,与同构基线相比,HexGen可实现延迟截止时间降低至2.3倍,或承受高达4倍的流量请求率。我们的实现代码已开源在 https://github.com/Relaxed-System-Lab/HexGen。