We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio
翻译:我们提出了一种新颖的多模态联合训练框架 MMAudio,用于在给定视频及可选文本条件下合成高质量且同步的音频。与仅依赖(有限)视频数据进行单模态训练的方法不同,MMAudio 通过联合训练更大规模、易于获取的文本-音频数据,学习生成语义对齐的高质量音频样本。此外,我们通过条件同步模块在帧级别对齐视频条件与音频隐变量,从而提升视听同步性。采用流匹配目标训练的 MMAudio 在音频质量、语义对齐和视听同步性方面,均达到公开模型中视频到音频合成的最新最优水平,同时具有较低推理时间(生成 8 秒片段仅需 1.23 秒)和仅 157M 参数量。MMAudio 在文本到音频生成任务中也表现出极具竞争力的性能,表明联合训练不会损害单模态性能。代码与演示见:https://hkchengrex.github.io/MMAudio