Splitwise: Efficient generative LLM inference using phase splitting

Recent innovations in generative large language models (LLMs) have made their applications and use-cases ubiquitous. This has led to large-scale deployments of these models, using complex, expensive, and power-hungry AI accelerators, most commonly GPUs. These developments make LLM inference efficiency an important challenge. Based on our extensive characterization, we find that there are two main phases during an LLM inference request: a compute-intensive prompt computation, and a memory-intensive token generation, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Specifically, unlike compute-intensive prompt computation phases, token generation phases do not require the compute capability of the latest GPUs, and can be run with lower power and cost. With Splitwise, we propose splitting the two phases of a LLM inference request on to separate machines. This allows us to use hardware that is well-suited for each phase, and provision resources independently per phase. However, splitting an inference request across machines requires state transfer from the machine running prompt computation over to the machine generating tokens. We implement and optimize this state transfer using the fast back-plane interconnects available in today's GPU clusters. We use the Splitwise technique to design LLM inference clusters using the same or different types of machines for the prompt computation and token generation phases. Our clusters are optimized for three key objectives: throughput, cost, and power. In particular, we show that we can achieve 1.4x higher throughput at 20% lower cost than current designs. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.

翻译：摘要：近年来，生成式大语言模型（LLM）的创新使其应用场景无处不在。这导致这些模型的大规模部署需要使用复杂、昂贵且功耗较高的AI加速器（主要是GPU）。这些发展使得LLM推理效率成为一项重要挑战。基于广泛的特性分析，我们发现LLM推理请求包含两个主要阶段：计算密集型的提示计算阶段和内存密集型的令牌生成阶段，两者在延迟、吞吐量、内存和功耗特性上存在显著差异。尽管采用了最先进的批处理和调度技术，令牌生成阶段仍然存在计算资源利用率不足的问题。具体而言，与计算密集型的提示计算阶段不同，令牌生成阶段无需最新GPU的计算性能，可以在更低功耗和成本下运行。通过Splitwise方法，我们提出将LLM推理请求的两个阶段拆分至不同机器上执行。这使我们能够针对各阶段选用最适配的硬件，并为各阶段独立配置资源。然而，跨机器拆分推理请求需要将状态从运行提示计算的机器转移到生成令牌的机器。我们利用当前GPU集群中可用的快速背板互连技术实现并优化了这一状态转移。通过Splitwise技术，我们为提示计算和令牌生成阶段设计了可使用相同或不同类型机器的LLM推理集群，并针对吞吐量、成本和功耗三个关键目标进行优化。实验表明，与现有方案相比，我们可在降低成本20%的同时实现1.4倍的吞吐量提升；或在相同成本和功耗预算下，将吞吐量提升至2.35倍。