Production LLM serving must simultaneously deliver high throughput, low latency, and sufficient context capacity under non-stationary traffic and mixed request requirements. Data parallelism (DP) maximizes throughput by running independent replicas, while tensor parallelism (TP) reduces per-request latency and pools memory for long-context inference. However, existing serving stacks typically commit to a static parallelism configuration at deployment; adapting to bursts, priorities, or long-context requests is often disruptive and slow. We present Flying Serving, a vLLM-based system that enables online DP-TP switching without restarting engine workers. Flying Serving makes reconfiguration practical by virtualizing the state that would otherwise force data movement: (i) a zero-copy Model Weights Manager that exposes TP shard views on demand, (ii) a KV Cache Adaptor that preserves request KV state across DP/TP layouts, (iii) an eagerly initialized Communicator Pool to amortize collective setup, and (iv) a deadlock-free scheduler that coordinates safe transitions under execution skew. Across three popular LLMs and realistic serving scenarios, Flying Serving improves performance by up to $4.79\times$ under high load and $3.47\times$ under low load while supporting latency- and memory-driven requests.
翻译:生产级大语言模型服务必须在非平稳流量和混合请求需求下,同时实现高吞吐量、低延迟和充足的上下文容量。数据并行通过运行独立副本最大化吞吐量,而张量并行则降低单请求延迟并为长上下文推理汇集内存。然而,现有服务栈通常在部署时采用静态并行配置;适应流量突发、优先级请求或长上下文请求往往具有破坏性且响应缓慢。本文提出飞行服务,这是一个基于vLLM的系统,能够在不重启引擎工作进程的情况下实现在线数据并行-张量并行切换。该系统通过虚拟化原本需要数据迁移的状态使重配置变得可行:(i) 零拷贝模型权重管理器按需提供张量并行分片视图,(ii) KV缓存适配器在数据并行/张量并行布局间保持请求KV状态,(iii) 预初始化的通信器池以分摊集合通信设置开销,(iv) 无死锁调度器在执行倾斜情况下协调安全过渡。在三种主流大语言模型和实际服务场景中,飞行服务在高负载下将性能提升最高达$4.79\times$,低负载下提升$3.47\times$,同时支持延迟驱动和内存驱动的请求。