Recent innovations in generative large language models (LLMs) have made their applications and use-cases ubiquitous. This has led to large-scale deployments of these models, using complex, expensive, and power-hungry AI accelerators, most commonly GPUs. These developments make LLM inference efficiency an important challenge. Based on our extensive characterization, we find that there are two main phases during an LLM inference request: a compute-intensive prompt computation, and a memory-intensive token generation, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Specifically, unlike compute-intensive prompt computation phases, token generation phases do not require the compute capability of the latest GPUs, and can be run with lower power and cost. With Splitwise, we propose splitting the two phases of a LLM inference request on to separate machines. This allows us to use hardware that is well-suited for each phase, and provision resources independently per phase. However, splitting an inference request across machines requires state transfer from the machine running prompt computation over to the machine generating tokens. We implement and optimize this state transfer using the fast back-plane interconnects available in today's GPU clusters. We use the Splitwise technique to design LLM inference clusters using the same or different types of machines for the prompt computation and token generation phases. Our clusters are optimized for three key objectives: throughput, cost, and power. In particular, we show that we can achieve 1.4x higher throughput at 20% lower cost than current designs. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.
翻译:生成式大语言模型的最新创新使其应用场景无处不在。这导致这些模型的大规模部署需要使用复杂、昂贵且功耗极高的人工智能加速器(最常见的是GPU)。这些发展使得大语言模型推理效率成为一项重要挑战。基于广泛的特性分析,我们发现大语言模型推理请求包含两个主要阶段:计算密集型的提示计算阶段和内存密集型的令牌生成阶段,二者在延迟、吞吐量、内存和功耗方面具有截然不同的特征。尽管采用了最先进的批处理与调度技术,令牌生成阶段仍存在计算资源利用率不足的问题。具体而言,与计算密集型的提示计算阶段不同,令牌生成阶段无需最新GPU的计算能力,可在更低功耗和成本下运行。通过Splitwise方法,我们提出将大语言模型推理请求的两个阶段拆分到不同机器上执行。这使得我们能够为每个阶段选用最适合的硬件,并独立配置每阶段的资源。然而,跨机器拆分推理请求需要将运行提示计算的机器状态迁移至生成令牌的机器。我们利用当前GPU集群中可用的快速背板互连技术实现并优化了这一状态迁移。采用Splitwise技术,我们设计了为提示计算和令牌生成阶段使用相同或不同类型机器的大语言模型推理集群。我们的集群针对三个关键目标进行了优化:吞吐量、成本和功耗。特别地,我们证明与现有设计相比,可在降低20%成本的同时实现1.4倍的吞吐量提升。或者,在相同成本和功耗预算下,可实现2.35倍的吞吐量增益。