Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.
翻译:音频对于真实世界视频而言不可或缺,然而生成模型在很大程度上忽视了音频组件。当前生成视听内容的方法通常依赖于级联流水线,这会增加成本、累积误差并降低整体质量。尽管Veo 3和Sora 2等系统强调了同步生成的价值,但联合多模态建模在架构、数据和训练方面引入了独特的挑战。此外,现有系统的闭源特性限制了该领域的进展。本工作提出MOVA(MOSS视频与音频),这是一个能够生成高质量同步视听内容的开源模型,包括逼真的唇语同步语音、环境感知音效以及内容对齐的音乐。MOVA采用混合专家(MoE)架构,总参数量达320亿,其中推理时激活参数为180亿。该模型支持IT2VA(图像-文本到视频-音频)生成任务。通过开源模型权重与代码,我们旨在推动研究发展并培育充满活力的创作者社区。发布的代码库具备高效推理、LoRA微调和提示增强的全面支持功能。