We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self attention operations. Unlike prior work that uses dedicated models for A2V and V2A tasks and relies on pretrained feature extractors, AV-Link achieves both tasks in a single self-contained framework, directly leveraging features obtained by the complementary modality (i.e. video features to generate audio, or audio features to generate video). Extensive automatic and subjective evaluations demonstrate that our method achieves a substantial improvement in audio-video synchronization, outperforming more expensive baselines such as the MovieGen video-to-audio model.
翻译:我们提出了AV-Link,一个用于视频到音频(V2A)和音频到视频(A2V)生成的统一框架。该框架利用冻结的视频和音频扩散模型的激活特征,实现时间对齐的跨模态条件生成。我们框架的核心是一个融合模块,该模块通过时间对齐的自注意力操作,促进视频和音频扩散模型之间的双向信息交换。与先前需要为V2A和A2V任务分别使用专用模型并依赖预训练特征提取器的工作不同,AV-Link在一个单一的自包含框架中实现两项任务,直接利用互补模态获得的特征(即利用视频特征生成音频,或利用音频特征生成视频)。大量的自动评估和主观评估表明,我们的方法在音视频同步方面取得了显著改进,其性能优于诸如MovieGen视频到音频模型等计算成本更高的基线方法。