Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated audio, as well as the alignment of temporal and loudness properties within the video. To address these issues, we construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. To ensure content consistency between the synthesized audio and target video, we introduce the Mask-Attention Module (MAM), which employs masked video instruction to enable the model to focus on regions of interest. Additionally, we implement the Time-Loudness Module (TLM), which uses an auxiliary loudness signal to ensure the synthesis of sound that aligns with the video in both loudness and temporal dimensions. Furthermore, we have extended a large-scale V2A dataset, named VGGSound-Caption, by annotating caption prompts. Extensive experiments on challenging benchmarks across two large-scale V2A datasets verify Draw an Audio achieves the state-of-the-art. Project page: https://yannqi.github.io/Draw-an-Audio/.
翻译:拟音(Foley)是电影制作中常用的术语,指通过添加日常音效来增强无声电影或视频的听觉体验。视频到音频(V2A)作为一种特殊的自动拟音任务,在视听同步方面存在固有挑战。这些挑战包括保持输入视频与生成音频之间的内容一致性,以及视频内部时间与响度属性的对齐。为解决这些问题,我们构建了一个可控的视频到音频合成模型,称为"绘制音频",该模型通过绘制掩码和响度信号支持多种输入指令。为确保合成音频与目标视频的内容一致性,我们引入了掩码注意力模块,该模块利用掩码视频指令使模型能够聚焦于感兴趣区域。此外,我们实现了时间-响度模块,该模块使用辅助响度信号来确保合成的声音在响度和时间维度上与视频对齐。进一步地,我们通过标注描述性提示词,扩展了一个名为VGGSound-Caption的大规模V2A数据集。在两个大规模V2A数据集的多个挑战性基准测试上的大量实验验证了"绘制音频"模型达到了最先进的性能水平。项目页面:https://yannqi.github.io/Draw-an-Audio/。