We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.
翻译:我们正进入一个复合模型架构的新时代,这些架构整合了视觉编码器、语言骨干网络、扩散与流头部、音频编解码器、动作生成器及世界模型预测器等多样化组件。此类架构支撑着广泛的多模态模型,包括统一多模态模型、全能模型、语音-语言模型、视觉-语言-动作策略及世界模型。然而,现有模型服务系统基于对模型结构的狭隘假设构建,难以适应这种新的架构多样性。本文提出M*——一种用于高效服务复合AI模型的通用服务系统。M*将模型表示为数据流图,处理横跨多种模态与任务的请求时,将其视为对图的遍历。其核心洞察在于一种模块化抽象:支持模型组件的任意组合、在物理集群上的灵活部署,以及分布式运行时中的模型无关优化。我们将该抽象称为“行走图”(Walk Graph),并展示其如何简洁地描述多个模型系列的复合架构。我们在代表性模型上部署M*,发现其在BAGEL模型的文生图工作负载上,相比vLLM-Omni实现平均20%的端到端延迟降低;在Qwen3-Omni模型的文本转语音工作负载上,实时因子降低高达2.9倍,吞吐量提升2.7倍。此外,M*在机器人规划任务上,相比V-JEPA 2-AC滚动基线实现了12.5倍的性能提升。因此,我们的工作为以最小开发代价高效服务复杂模型铺平了道路。