Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.
翻译:大语言模型(LLM)服务已从无状态系统演变为有状态系统,并采用了上下文缓存与分离式推理等技术。这些优化措施延长了KV缓存的生命周期并扩展了其应用领域,从而需要新的架构设计思路。本文提出MemServe——一个统一整合了请求间与请求内优化的系统。MemServe引入了MemPool,这是一个弹性内存池,用于管理跨服务实例的分布式内存及KV缓存。通过MemPool API,MemServe首次实现了上下文缓存与分离式推理的协同工作,并辅以全局调度器,该调度器通过基于全局提示树的局部性感知策略来提升缓存复用效率。测试结果表明,MemServe显著改善了任务完成时间与首次响应时间。