Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. A prevalent problem in V2A generation is the misalignment of generated audio with the visible actions in the video. To address this, we explore three novel visual attention mechanisms. We further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen outperforms previous systems across all objective metrics and human evaluations.
翻译:近年来,大规模深度学习模型与扩展数据集的演进推动了音频生成领域的重大进展。然而,视频到音频(V2A)生成任务仍面临挑战,主要源于高维视觉与听觉数据之间复杂的关联性以及时间同步难题。本研究提出FoleyGen——基于语言建模范式的开放域V2A生成系统,该模型利用现成的神经音频编解码器实现波形与离散令牌的双向转换,并通过单一Transformer模型在视觉编码器提取的视觉特征条件下生成音频令牌。针对V2A生成中普遍存在的音频与视频可见动作失配问题,我们探索了三种新型视觉注意力机制,并对多个预训练于单模态或多模态任务的视觉编码器进行了系统性评估。在VGGSound数据集上的实验结果表明,所提出的FoleyGen在客观指标与人工评估中均全面超越已有系统。