Production LLM serving must simultaneously deliver high throughput, low latency, and sufficient context capacity under non-stationary traffic and mixed request requirements. Data parallelism (DP) maximizes throughput by running independent replicas, while tensor parallelism (TP) reduces per-request latency and pools memory for long-context inference. However, existing serving stacks typically commit to a static parallelism configuration at deployment; adapting to bursts, priorities, or long-context requests is often disruptive and slow. We present Flying Serving, a vLLM-based system that enables online DP-TP switching without restarting engine workers. Flying Serving makes reconfiguration practical by virtualizing the state that would otherwise force data movement: (i) a zero-copy Model Weights Manager that exposes TP shard views on demand, (ii) a KV Cache Adaptor that preserves request KV state across DP/TP layouts, (iii) an eagerly initialized Communicator Pool to amortize collective setup, and (iv) a deadlock-free scheduler that coordinates safe transitions under execution skew. Across three popular LLMs and realistic serving scenarios, Flying Serving improves performance by up to $4.79\times$ under high load and $3.47\times$ under low load while supporting latency- and memory-driven requests.
翻译:生产级大语言模型(LLM)服务必须在非平稳流量和混合请求需求的场景下,同时实现高吞吐量、低延迟和充足的上下文容量。数据并行(DP)通过运行独立的模型副本来最大化吞吐量,而张量并行(TP)则能降低单请求延迟并为长上下文推理汇集内存。然而,现有的服务栈通常在部署时采用静态的并行配置;适应流量突发、请求优先级或长上下文请求往往具有破坏性且速度缓慢。本文提出Flying Serving,一个基于vLLM的系统,支持在不重启引擎工作进程的情况下在线切换DP与TP。Flying Serving通过虚拟化原本需要数据移动的状态,使重新配置变得切实可行:(i)一个零拷贝的模型权重管理器,可按需提供TP分片视图;(ii)一个KV缓存适配器,可在不同DP/TP布局间保留请求的KV状态;(iii)一个预先初始化的通信器池,用于分摊集合通信的建立开销;(iv)一个无死锁的调度器,在执行不均衡的情况下协调安全的切换过程。在三种流行LLM及现实服务场景下的实验表明,Flying Serving在高负载下性能提升最高达$4.79\times$,在低负载下提升最高达$3.47\times$,同时支持延迟驱动和内存驱动的请求。