Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.
翻译:大语言模型服务已从无状态系统演变为有状态系统,采用上下文缓存与分离式推理等技术。这些优化措施延长了KV缓存的生命周期并扩展了其应用领域,从而需要新的架构方法。本文提出MemServe,一个统一集成请求间与请求内优化的系统。MemServe引入MemPool——一个弹性内存池,用于管理跨服务实例的分布式内存与KV缓存。通过MemPool API,MemServe首次将上下文缓存与分离式推理相结合,并辅以全局调度器,该调度器通过基于全局提示树的局部性感知策略提升缓存复用率。测试表明,MemServe显著改善了作业完成时间与首次响应时间。