Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely available in the real world and can not be applied to synthesize impact sounds from common videos. On the other hand, existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds since they lack of physics knowledge. In this work, we propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip. In addition to the video content, we propose to use additional physics priors to guide the impact sound synthesis procedure. The physics priors include both physics parameters that are directly estimated from noisy real-world impact sound examples without sophisticated setup and learned residual parameters that interpret the sound environment via neural networks. We further implement a novel diffusion model with specific training and inference strategies to combine physics priors and visual information for impact sound synthesis. Experimental results show that our model outperforms several existing systems in generating realistic impact sounds. More importantly, the physics-based representations are fully interpretable and transparent, thus enabling us to perform sound editing flexibly.
翻译:物理物体交互发出的声音建模对于实现真实与虚拟世界中的沉浸式感知体验至关重要。传统的撞击声合成方法通过物理模拟获取一组可表征并合成声音的物理参数,但这些方法需要物体几何形状和撞击位置的精细细节,这在现实世界中难以获取,且无法应用于从普通视频中合成撞击声。另一方面,现有的基于视频驱动的深度学习方法因缺乏物理知识,仅能捕捉视觉内容与撞击声之间的弱关联。本文提出一种物理驱动扩散模型,可为无声视频片段合成高保真撞击声。除视频内容外,我们引入额外物理先验来指导撞击声合成过程。这些物理先验包括两类:一类是通过无需复杂设置的噪声真实世界撞击声样本直接估计的物理参数,另一类是通过神经网络学习到的、用于解释声音环境的残差参数。我们进一步设计了一种新型扩散模型,通过特定的训练与推理策略融合物理先验与视觉信息进行撞击声合成。实验结果表明,本模型在生成真实撞击声方面优于多个现有系统。更重要的是,基于物理的表征具有完全可解释性与透明性,从而支持灵活的声音编辑操作。