Serving generative inference of the large-scale foundation model is a crucial component of contemporary AI applications. This paper focuses on deploying such services in a heterogeneous and decentralized setting to mitigate the substantial inference costs typically associated with centralized data centers. Towards this end, we propose HexGen, a flexible distributed inference engine that uniquely supports the asymmetric partition of generative inference computations over both tensor model parallelism and pipeline parallelism and allows for effective deployment across diverse GPUs interconnected by a fully heterogeneous network. We further propose a sophisticated scheduling algorithm grounded in constrained optimization that can adaptively assign asymmetric inference computation across the GPUs to fulfill inference requests while maintaining acceptable latency levels. We conduct an extensive evaluation to verify the efficiency of HexGen by serving the state-of-the-art Llama-2 (70B) model. The results suggest that HexGen can choose to achieve up to 2.3 times lower latency deadlines or tolerate up to 4 times more request rates compared with the homogeneous baseline given the same budget.
翻译:服务大规模基础模型的生成式推理是当代AI应用的关键组成部分。本文聚焦于在异构且去中心化的环境中部署此类服务,以降低通常在集中式数据中心中产生的高昂推理成本。为此,我们提出HexGen,一种灵活的去中心化推理引擎,其独特之处在于支持对生成式推理计算进行张量模型并行与流水线并行上的非对称划分,并允许在由完全异构网络互联的各类GPU上进行有效部署。我们进一步提出一种基于约束优化的精密调度算法,该算法能自适应地将非对称推理计算分配给各GPU,在满足推理请求的同时维持可接受的延迟水平。我们通过服务当前最先进的Llama-2(70B)模型进行了广泛评估,以验证HexGen的效率。结果表明,与相同预算下的同构基线相比,HexGen可选择实现高达2.3倍的延迟下限降低,或容忍高达4倍的请求率提升。