Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake's innovative architecture enables Kimi to handle 75% more requests.
翻译:Mooncake是月之暗面(Moonshot AI)旗下领先大语言模型服务Kimi的推理服务平台。其核心特征在于采用以KV缓存为中心的解耦架构,实现了预填充阶段与解码阶段的集群分离。该平台创新性地利用GPU集群中未被充分使用的CPU、DRAM及SSD资源,构建了分布式KV缓存系统。Mooncake的核心是其以KV缓存为中心的调度器,该调度器在满足延迟相关服务等级目标的同时,致力于最大化系统整体有效吞吐量。与传统研究中假设所有请求均会被处理的场景不同,Mooncake需应对极端过载场景带来的挑战。为此,我们开发了基于预测的早期请求拒绝策略。实验表明,Mooncake在长上下文场景中表现卓越。在特定模拟场景下,相较于基线方法,Mooncake在满足服务等级目标的前提下可实现高达525%的吞吐量提升。在实际工作负载中,Mooncake的创新架构使Kimi能够多处理75%的请求。