Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.
翻译:大语言模型(LLM)服务已从无状态系统演变为有状态系统,并采用了上下文缓存与分离式推理等技术。这些优化措施延长了KV缓存的生命周期并拓展了其应用领域,从而需要一种新的架构设计方法。本文提出MemServe,一个统一整合了请求间优化与请求内优化的系统。MemServe引入了MemPool——一个弹性内存池,用于管理跨服务实例的分布式内存及KV缓存。通过MemPool API,MemServe首次将上下文缓存与分离式推理相结合,并得到全局调度器的支持。该调度器采用基于全局提示树的局部性感知策略,从而提升缓存复用率。测试结果表明,MemServe显著改善了作业完成时间与首次生成时间。