Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake's innovative architecture enables Kimi to handle 75% more requests.

翻译：Mooncake是月之暗面（Moonshot AI）旗下领先大语言模型服务Kimi的推理服务平台。其核心特征在于采用以KVCache为中心的分离式架构，将预填充阶段与解码阶段的计算集群进行解耦。该平台通过利用GPU集群中未被充分使用的CPU、DRAM及SSD资源，构建了分布式的KVCache分离式缓存系统。Mooncake的核心组件是其以KVCache为中心的调度器，该调度器在满足延迟相关服务等级目标（SLO）的同时，致力于最大化系统整体有效吞吐量。与传统研究中假设所有请求均会被处理的场景不同，Mooncake需应对极端过载场景带来的挑战。为此，我们开发了基于预测的早期请求拒绝策略。实验表明，Mooncake在长上下文场景中表现优异。在特定模拟场景下，相较于基线方法，Mooncake在满足SLO要求的同时可实现高达525%的吞吐量提升。在实际工作负载下，Mooncake的创新架构使Kimi能够多处理75%的请求。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日