In this study, we aim to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides each single-modal model to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightweight joint guidance module to adjust scores separately estimated by the base models to match the score of joint distribution over audio and video. We theoretically show that this guidance can be computed through the gradient of the optimal discriminator distinguishing real audio-video pairs from fake ones independently generated by the base models. On the basis of this analysis, we construct the joint guidance module by training this discriminator. Additionally, we adopt a loss function to make the gradient of the discriminator work as a noise estimator, as in standard diffusion models, stabilizing the gradient of the discriminator. Empirical evaluations on several benchmark datasets demonstrate that our method improves both single-modal fidelity and multi-modal alignment with a relatively small number of parameters.
翻译:本研究旨在通过利用预训练的音频与视频单模态生成模型,以最小计算成本构建音视频生成模型。为实现这一目标,我们提出一种新方法,引导每个单模态模型协同生成跨模态良好对齐的样本。具体而言,给定两个预训练的基础扩散模型,我们训练一个轻量级联合引导模块,分别调整基础模型估计的分数以匹配音频与视频联合分布的分数。我们从理论上证明,该引导可通过最优判别器的梯度计算,该判别器用于区分真实音视频对与基础模型独立生成的虚假音视频对。基于此分析,我们通过训练该判别器构建联合引导模块。此外,我们采用损失函数使判别器的梯度作为噪声估计器工作(如标准扩散模型),从而稳定判别器的梯度。在多个基准数据集上的实证评估表明,我们的方法以相对较少的参数同时提升了单模态保真度与多模态对齐度。