Advances in multi-modal generative models are enabling new applications, from storytelling to automated media synthesis. Most current workloads generate simple outputs (e.g., image generation from a prompt) in batch mode, often requiring several seconds even for basic results. Serving real-time multi-modal workflows at scale is costly and complex, requiring efficient coordination of diverse models (each with unique resource needs) across language, audio, image, and video, all under strict latency and resource constraints. We tackle these challenges through the lens of real-time podcast video generation, integrating LLMs, text-to-speech, and video-audio generation. To meet tight SLOs, we design an adaptive, modular serving system, StreamWise, that dynamically manages quality (e.g., resolution, sharpness), model/content parallelism, and resource-aware scheduling. We leverage heterogeneous hardware to maximize responsiveness and efficiency. For example, the system can lower video resolution and allocate more resources to early scenes. We quantify the trade-offs between latency, cost, and quality. The cheapest setup generates a 10-minute podcast video on A100 GPUs in 1.4 hours (8.4x slower than the real-time) for less than \$25. StreamWise enables high-quality real-time streaming with a sub-second startup delay under $45.
翻译:多模态生成模型的进展正在催生从故事叙述到自动化媒体合成的新应用。当前多数工作负载以批处理模式生成简单输出(例如根据提示生成图像),即使基础结果也常需数秒。大规模实时服务多模态工作流成本高昂且复杂,需要在严格延迟和资源约束下,高效协调语言、音频、图像和视频等不同领域模型(各模型具有独特资源需求)。我们通过实时播客视频生成场景应对这些挑战,整合了LLM、文本转语音及视频-音频生成技术。为满足严格服务水平目标,我们设计了自适应模块化服务系统StreamWise,可动态管理质量参数(如分辨率、锐度)、模型/内容并行度及资源感知调度。我们利用异构硬件最大化响应速度与效率,例如系统可降低视频分辨率并为早期场景分配更多资源。我们量化了延迟、成本与质量间的权衡关系:最经济配置在A100 GPU上以1.4小时生成10分钟播客视频(比实时慢8.4倍),成本低于25美元;而StreamWise能在45美元预算内实现亚秒级启动延迟的高质量实时流式生成。