MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team, :,Donghua Yu,Mingshu Chen,Qi Chen,Qi Luo,Qianyi Wu,Qinyuan Cheng,Ruixiao Li,Tianyi Liang,Wenbo Zhang,Wenming Tu,Xiangyu Peng,Yang Gao,Yanru Huo,Ying Zhu,Yinze Luo,Yiyang Zhang,Yuerong Song,Zhe Xu,Zhiyu Zhang,Chenchen Yang,Cheng Chang,Chushu Zhou,Hanfu Chen,Hongnan Ma,Jiaxi Li,Jingqi Tong,Junxi Liu,Ke Chen,Shimin Li,Songlin Wang,Wei Jiang,Zhaoye Fei,Zhiyuan Ning,Chunguo Li,Chenhui Li,Ziwei He,Zengfeng Huang,Xie Chen,Xipeng Qiu

from arxiv, Technical report for MOVA (open-source video-audio generation model). 38 pages, 10 figures, 22 tables. Project page: https://mosi.cn/models/mova Code: https://github.com/OpenMOSS/MOVA Models: https://huggingface.co/collections/OpenMOSS-Team/mova. Qinyuan Cheng and Tianyi Liang are project leader. Xie Chen and Xipeng Qiu are corresponding authors

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

翻译：音频对于真实世界视频而言不可或缺，然而生成模型在很大程度上忽视了音频组件。当前生成视听内容的方法通常依赖于级联流水线，这会增加成本、累积误差并降低整体质量。尽管Veo 3和Sora 2等系统强调了同步生成的价值，但联合多模态建模在架构、数据和训练方面引入了独特的挑战。此外，现有系统的闭源特性限制了该领域的进展。本工作提出MOVA（MOSS视频与音频），这是一个能够生成高质量同步视听内容的开源模型，包括逼真的唇语同步语音、环境感知音效以及内容对齐的音乐。MOVA采用混合专家（MoE）架构，总参数量达320亿，其中推理时激活参数为180亿。该模型支持IT2VA（图像-文本到视频-音频）生成任务。通过开源模型权重与代码，我们旨在推动研究发展并培育充满活力的创作者社区。发布的代码库具备高效推理、LoRA微调和提示增强的全面支持功能。