We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. Recognizing the importance of accurate alignment between video and audio events in multi-modal generation tasks, we propose a joint contrastive training loss to enhance the synchronization between visual and auditory occurrences. Our research methodology involves conducting comprehensive experiments on multiple datasets to thoroughly evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline, substantiating its effectiveness and efficiency. Notably, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task. These results indicate the potential of our proposed model as a robust solution for improving the quality and alignment of multi-modal generation, thereby contributing to the advancement of video and audio conditional generation systems.
翻译:我们提出了一种专为视频和音频双向条件生成而设计的多模态扩散模型。鉴于多模态生成任务中视频与音频事件精确对齐的重要性,我们提出了一种联合对比训练损失函数,以增强视觉与听觉事件之间的同步性。研究方法包括在多个数据集上进行全面的实验,以充分评估所提模型的有效性。从客观指标与主观指标等多个角度对生成质量和对齐性能进行了评估。结果表明,所提模型优于基线模型,证实了其有效性和高效性。值得注意的是,对比损失函数的引入显著改善了音视频对齐,尤其是在高相关性的视频到音频生成任务中。这些结果揭示了所提模型作为提升多模态生成质量与对齐能力的稳健解决方案的潜力,从而推动了视频与音频条件生成系统的发展。