Sound designers and Foley artists usually sonorize a scene, such as from a movie or video game, by manually annotating and sonorizing each action of interest in the video. In our case, the intent is to leave full creative control to sound designers with a tool that allows them to bypass the more repetitive parts of their work, thus being able to focus on the creative aspects of sound production. We achieve this presenting Stable-V2A, a two-stage model consisting of: an RMS-Mapper that estimates an envelope representative of the audio characteristics associated with the input video; and Stable-Foley, a diffusion model based on Stable Audio Open that generates audio semantically and temporally aligned with the target video. Temporal alignment is guaranteed by the use of the envelope as a ControlNet input, while semantic alignment is achieved through the use of sound representations chosen by the designer as cross-attention conditioning of the diffusion process. We train and test our model on Greatest Hits, a dataset commonly used to evaluate V2A models. In addition, to test our model on a case study of interest, we introduce Walking The Maps, a dataset of videos extracted from video games depicting animated characters walking in different locations. Samples and code available on our demo page at https://ispamm.github.io/Stable-V2A.
翻译:音效设计师和拟音师通常通过手动标注视频中每个感兴趣的动作并为其配音,来为电影或电子游戏等场景添加声音。在我们的工作中,旨在通过提供一个工具,将完整的创意控制权留给音效设计师,使他们能够绕过工作中重复性较高的部分,从而专注于声音制作的创意层面。为此,我们提出了Stable-V2A,一个两阶段模型,包括:RMS-Mapper,用于估计与输入视频相关的音频特征包络;以及Stable-Foley,一个基于Stable Audio Open的扩散模型,用于生成在语义和时间上与目标视频对齐的音频。时间对齐通过将包络作为ControlNet的输入来保证,而语义对齐则是通过使用设计师选择的声音表示作为扩散过程的交叉注意力条件来实现。我们在Greatest Hits数据集上训练和测试了我们的模型,该数据集通常用于评估V2A模型。此外,为了在一个感兴趣的案例研究中测试我们的模型,我们引入了Walking The Maps数据集,该数据集包含从电子游戏中提取的视频,描绘了动画角色在不同地点行走的场景。样本和代码可在我们的演示页面https://ispamm.github.io/Stable-V2A获取。