Any-to-any multimodal models that jointly handle text, images, video, and audio represent a significant advance in multimodal AI. However, their complex architectures (typically combining multiple autoregressive LLMs, diffusion transformers, and other specialized components) pose substantial challenges for efficient model serving. Existing serving systems are mainly tailored to a single paradigm, such as autoregressive LLMs for text generation or diffusion transformers for visual generation. They lack support for any-to-any pipelines that involve multiple interconnected model components. As a result, developers must manually handle cross-stage interactions, leading to huge performance degradation. We present vLLM-Omni, a fully disaggregated serving system for any-to-any models. vLLM-Omni features a novel stage abstraction that enables users to decompose complex any-to-any architectures into interconnected stages represented as a graph, and a disaggregated stage execution backend that optimizes resource utilization and throughput across stages. Each stage is independently served by an LLM or diffusion engine with per-stage request batching, flexible GPU allocation, and unified inter-stage connectors for data routing. Experimental results demonstrate that vLLM-Omni reduces job completion time (JCT) by up to 91.4% compared to baseline methods. The code is public available at https://github.com/vllm-project/vllm-omni.
翻译:任意到任意多模态模型能够联合处理文本、图像、视频和音频,代表了多模态人工智能领域的重大进展。然而,其复杂的架构(通常结合了多个自回归大语言模型、扩散Transformer及其他专用组件)为高效的模型服务带来了巨大挑战。现有的服务系统主要针对单一范式定制,例如用于文本生成的自回归大语言模型或用于视觉生成的扩散Transformer。它们缺乏对涉及多个互连模型组件的任意到任意流程的支持。因此,开发者必须手动处理跨阶段交互,导致严重的性能下降。我们提出了vLLM-Omni,一个为任意到任意模型设计的完全解耦式服务系统。vLLM-Omni具有新颖的阶段抽象,使用户能够将复杂的任意到任意架构分解为以图表示的互连阶段,以及一个解耦的阶段执行后端,用于优化跨阶段的资源利用率和吞吐量。每个阶段由一个LLM或扩散引擎独立服务,具备逐阶段请求批处理、灵活的GPU分配以及用于数据路由的统一阶段间连接器。实验结果表明,与基线方法相比,vLLM-Omni将作业完成时间降低了高达91.4%。代码已在 https://github.com/vllm-project/vllm-omni 公开。