Stable-V2A：具有时间和语义控制的同步音效合成 (Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic Controls)

Sound designers and Foley artists usually sonorize a scene, such as from a movie or video game, by manually annotating and sonorizing each action of interest in the video. In our case, the intent is to leave full creative control to sound designers with a tool that allows them to bypass the more repetitive parts of their work, thus being able to focus on the creative aspects of sound production. We achieve this presenting Stable-V2A, a two-stage model consisting of: an RMS-Mapper that estimates an envelope representative of the audio characteristics associated with the input video; and Stable-Foley, a diffusion model based on Stable Audio Open that generates audio semantically and temporally aligned with the target video. Temporal alignment is guaranteed by the use of the envelope as a ControlNet input, while semantic alignment is achieved through the use of sound representations chosen by the designer as cross-attention conditioning of the diffusion process. We train and test our model on Greatest Hits, a dataset commonly used to evaluate V2A models. In addition, to test our model on a case study of interest, we introduce Walking The Maps, a dataset of videos extracted from video games depicting animated characters walking in different locations. Samples and code available on our demo page at https://ispamm.github.io/Stable-V2A.

翻译：音效设计师和拟音师通常通过手动标注视频中每个感兴趣的动作并为其配音，来为电影或电子游戏等场景添加声音。在我们的工作中，旨在通过提供一个工具，将完整的创意控制权留给音效设计师，使他们能够绕过工作中重复性较高的部分，从而专注于声音制作的创意层面。为此，我们提出了Stable-V2A，一个两阶段模型，包括：RMS-Mapper，用于估计与输入视频相关的音频特征包络；以及Stable-Foley，一个基于Stable Audio Open的扩散模型，用于生成在语义和时间上与目标视频对齐的音频。时间对齐通过将包络作为ControlNet的输入来保证，而语义对齐则是通过使用设计师选择的声音表示作为扩散过程的交叉注意力条件来实现。我们在Greatest Hits数据集上训练和测试了我们的模型，该数据集通常用于评估V2A模型。此外，为了在一个感兴趣的案例研究中测试我们的模型，我们引入了Walking The Maps数据集，该数据集包含从电子游戏中提取的视频，描绘了动画角色在不同地点行走的场景。样本和代码可在我们的演示页面https://ispamm.github.io/Stable-V2A获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日