Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely available in the real world and can not be applied to synthesize impact sounds from common videos. On the other hand, existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds since they lack of physics knowledge. In this work, we propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip. In addition to the video content, we propose to use additional physics priors to guide the impact sound synthesis procedure. The physics priors include both physics parameters that are directly estimated from noisy real-world impact sound examples without sophisticated setup and learned residual parameters that interpret the sound environment via neural networks. We further implement a novel diffusion model with specific training and inference strategies to combine physics priors and visual information for impact sound synthesis. Experimental results show that our model outperforms several existing systems in generating realistic impact sounds. More importantly, the physics-based representations are fully interpretable and transparent, thus enabling us to perform sound editing flexibly.
翻译:从物理物体交互中模拟声音对于现实与虚拟世界中的沉浸式感知体验至关重要。传统的碰撞声合成方法通过物理仿真获取一组能表征并合成声音的物理参数。然而,这些方法需要精确的物体几何形状和碰撞位置细节,这在真实世界中难以获取,且无法应用于从普通视频中合成碰撞声。与此同时,现有的基于视频的深度学习方法因缺乏物理知识,仅能捕捉视觉内容与碰撞声之间的弱关联。本文提出一种物理驱动扩散模型,可为无声视频片段合成高保真碰撞声。除了视频内容外,我们引入额外物理先验来指导碰撞声合成过程。这些物理先验包括两类:直接从含噪真实碰撞声样本中估计的物理参数(无需复杂设备),以及通过神经网络学习的声音环境残差参数。我们进一步设计了一种新型扩散模型,结合特定训练与推理策略,融合物理先验与视觉信息进行碰撞声合成。实验结果表明,我们的模型在生成真实碰撞声方面优于多个现有系统。更重要的是,基于物理的表征完全可解释且透明,从而能够灵活进行声音编辑。