In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which provides different timestep information to each base model. It is designed to align how samples are generated along with timesteps across modalities. The second one is a new design of the additional modules, termed Cross-Modal Conditioning as Positional Encoding (CMC-PE). In CMC-PE, cross-modal information is embedded as if it represents temporal position information, and the embeddings are fed into the model like positional encoding. Compared with the popular cross-attention mechanism, CMC-PE provides a better inductive bias for temporal alignment in the generated data. Experimental results validate the effectiveness of the two newly introduced mechanisms and also demonstrate that our method outperforms existing methods.
翻译:在本研究中,我们构建了一个简单而强大的有声视频生成基线。给定音频和视频的基础扩散模型,我们通过附加模块将其整合为单一模型并进行训练,使模型能够联合生成音频与视频。为增强音视频对之间的对齐效果,我们在模型中引入了两种创新机制。其一是时间步调整机制,该机制为各基础模型提供不同的时间步信息,旨在实现跨模态样本随时间步生成的对齐。其二是名为"跨模态条件化位置编码"(CMC-PE)的附加模块新设计。在CMC-PE中,跨模态信息被嵌入为类似时间位置信息的表示,并以位置编码的形式馈入模型。与常用的交叉注意力机制相比,CMC-PE为生成数据中的时序对齐提供了更优的归纳偏置。实验结果验证了两种新引入机制的有效性,并证明我们的方法优于现有方法。