Prefill and decode (PD) disaggregation separates prompt prefill and token-by-token decode stages into distinct GPU pools and has become the dominant architecture for large-scale LLM serving in industry. Also, retrieval tasks via vector search remains entangled with the model inference process, like heterogeneous RAG requests and prompt answer caches, inflating tail latency. We are motivated to investigate how vector search should be orchestrated along with PD disaggregation with a dedicated deployment architecture without violating SLOs in various retrieval workloads. We present Trinity, a practical framework that consolidates all retrieval into a single, shared vector-search GPU pool and make it work with PD disaggregated LLM serving in match. Trinity introduces (1) a novel architecture for deploying GPU-based vector search service in PD disaggregation. (2) Continuous batching for vector search that make full used of GPUs under heterogeneous queries; (3) Stage-aware scheduling that preempts vector search requests between both decode and prefill tasks.
翻译:预填充与解码(PD)解耦架构将提示预填充和逐令牌解码阶段分离到不同的GPU池中,已成为工业界大规模大语言模型服务的主流架构。然而,通过向量搜索进行的检索任务仍与模型推理过程紧密耦合,例如异构的检索增强生成请求和提示答案缓存,导致尾部延迟增加。我们旨在研究如何在PD解耦架构中协调向量搜索,设计一种专用部署架构,同时确保在各种检索工作负载下不违反服务水平目标。我们提出了Trinity,一个实用框架,将所有检索任务整合到单一共享的向量搜索GPU池中,使其与PD解耦的大语言模型服务协同工作。Trinity引入了(1)一种在PD解耦架构中部署基于GPU的向量搜索服务的新颖架构;(2)面向向量搜索的连续批处理技术,充分利用GPU处理异构查询;(3)阶段感知调度机制,可在解码和预填充任务之间抢占向量搜索请求。