Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos

Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely available in the real world and can not be applied to synthesize impact sounds from common videos. On the other hand, existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds since they lack of physics knowledge. In this work, we propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip. In addition to the video content, we propose to use additional physics priors to guide the impact sound synthesis procedure. The physics priors include both physics parameters that are directly estimated from noisy real-world impact sound examples without sophisticated setup and learned residual parameters that interpret the sound environment via neural networks. We further implement a novel diffusion model with specific training and inference strategies to combine physics priors and visual information for impact sound synthesis. Experimental results show that our model outperforms several existing systems in generating realistic impact sounds. More importantly, the physics-based representations are fully interpretable and transparent, thus enabling us to perform sound editing flexibly.

翻译：物理物体交互发出的声音建模对于实现真实与虚拟世界中的沉浸式感知体验至关重要。传统的撞击声合成方法通过物理模拟获取一组可表征并合成声音的物理参数，但这些方法需要物体几何形状和撞击位置的精细细节，这在现实世界中难以获取，且无法应用于从普通视频中合成撞击声。另一方面，现有的基于视频驱动的深度学习方法因缺乏物理知识，仅能捕捉视觉内容与撞击声之间的弱关联。本文提出一种物理驱动扩散模型，可为无声视频片段合成高保真撞击声。除视频内容外，我们引入额外物理先验来指导撞击声合成过程。这些物理先验包括两类：一类是通过无需复杂设置的噪声真实世界撞击声样本直接估计的物理参数，另一类是通过神经网络学习到的、用于解释声音环境的残差参数。我们进一步设计了一种新型扩散模型，通过特定的训练与推理策略融合物理先验与视觉信息进行撞击声合成。实验结果表明，本模型在生成真实撞击声方面优于多个现有系统。更重要的是，基于物理的表征具有完全可解释性与透明性，从而支持灵活的声音编辑操作。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日