We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. The code and pre-trained models can be downloaded at https://github.com/researchmm/MM-Diffusion.
翻译:我们提出了首个音视频联合生成框架,能够同时带来沉浸式的观看与聆听体验,生成高质量的真实视频。为了生成音视频配对数据,我们提出了一种新颖的多模态扩散模型(即MM-Diffusion),它包含两个耦合的去噪自编码器。与现有的单模态扩散模型不同,MM-Diffusion通过设计,采用了一个顺序多模态U-Net来执行联合去噪过程。其中,用于音频和视频的两个子网络学习从高斯噪声中逐步生成对齐的音视频对。为确保跨模态的语义一致性,我们提出了一种基于随机移位的注意力模块,架起了两个子网络之间的桥梁,从而实现了高效的跨模态对齐,并以此相互增强音视频的真实性。大量实验表明,该方法在无条件的音视频生成以及零样本条件任务(例如视频到音频)上均取得了优越的结果。特别地,我们在Landscape和AIST++舞蹈数据集上获得了最佳的FVD和FAD指标。基于10,000张投票的图灵测试进一步证明了我们模型的显著优势。代码与预训练模型可在https://github.com/researchmm/MM-Diffusion 下载。