We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse.
翻译:本文研究如何利用来自多种语义类别的自然音频样本,生成多样化且逼真的视频。该任务要求视频与输入音频在全局和时间维度上对齐:全局上,输入音频与整段输出视频语义相关;时间上,输入音频的每个片段需与视频的对应片段关联。我们利用现有的文本条件视频生成模型和预训练音频编码器模型。所提出方法基于轻量级适配网络,该网络学习将音频表征映射为文本到视频生成模型所需的输入表征。由此,该方法还能实现基于文本、音频的条件视频生成,并且据我们所知,首次同时基于文本和音频实现条件生成。我们在三个数据集上进行了广泛验证,展示了音频-视频样本显著的语义多样性,并进一步提出一种新的评估指标(AV-Align),用于衡量生成视频与输入音频样本的对齐程度。AV-Align基于对两种模态中能量峰值的检测与比较。与近期最先进方法相比,我们的方法生成的视频在内容与时间轴上都与输入声音更好地对齐。同时,我们的方法生成的视频具有更高的视觉质量和更强的多样性。