Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated audio, as well as the alignment of temporal and loudness properties within the video. To address these issues, we construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. To ensure content consistency between the synthesized audio and target video, we introduce the Mask-Attention Module (MAM), which employs masked video instruction to enable the model to focus on regions of interest. Additionally, we implement the Time-Loudness Module (TLM), which uses an auxiliary loudness signal to ensure the synthesis of sound that aligns with the video in both loudness and temporal dimensions. Furthermore, we have extended a large-scale V2A dataset, named VGGSound-Caption, by annotating caption prompts. Extensive experiments on challenging benchmarks across two large-scale V2A datasets verify Draw an Audio achieves the state-of-the-art. Project page: https://yannqi.github.io/Draw-an-Audio/.

翻译：拟音（Foley）是电影制作中常用的术语，指通过添加日常音效来增强无声电影或视频的听觉体验。视频到音频（V2A）作为一种特殊的自动拟音任务，在视听同步方面存在固有挑战。这些挑战包括保持输入视频与生成音频之间的内容一致性，以及视频内部时间与响度属性的对齐。为解决这些问题，我们构建了一个可控的视频到音频合成模型，称为"绘制音频"，该模型通过绘制掩码和响度信号支持多种输入指令。为确保合成音频与目标视频的内容一致性，我们引入了掩码注意力模块，该模块利用掩码视频指令使模型能够聚焦于感兴趣区域。此外，我们实现了时间-响度模块，该模块使用辅助响度信号来确保合成的声音在响度和时间维度上与视频对齐。进一步地，我们通过标注描述性提示词，扩展了一个名为VGGSound-Caption的大规模V2A数据集。在两个大规模V2A数据集的多个挑战性基准测试上的大量实验验证了"绘制音频"模型达到了最先进的性能水平。项目页面：https://yannqi.github.io/Draw-an-Audio/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/