We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: snap-research.github.io/AVLink/
翻译:我们提出了AV-Link,一个用于视频到音频和音频到视频生成的统一框架,该框架利用冻结的视频与音频扩散模型的激活特征进行时序对齐的跨模态条件生成。我们框架的核心是一个融合模块,该模块通过时序对齐的自注意力操作,实现骨干视频与音频扩散模型之间的双向信息交换。与先前使用为其他任务预训练的特征提取器来生成条件信号的工作不同,AV-Link可以在单一框架内直接利用互补模态获得的特征,即利用视频特征生成音频,或利用音频特征生成视频。我们深入评估了我们的设计选择,并证明了我们的方法能够生成同步且高质量的视听内容,展示了其在沉浸式媒体生成应用中的潜力。项目页面:snap-research.github.io/AVLink/