We propose a method for adding sound-guided visual effects to specific regions of videos with a zero-shot setting. Animating the appearance of the visual effect is challenging because each frame of the edited video should have visual changes while maintaining temporal consistency. Moreover, existing video editing solutions focus on temporal consistency across frames, ignoring the visual style variations over time, e.g., thunderstorm, wave, fire crackling. To overcome this limitation, we utilize temporal sound features for the dynamic style. Specifically, we guide denoising diffusion probabilistic models with an audio latent representation in the audio-visual latent space. To the best of our knowledge, our work is the first to explore sound-guided natural video editing from various sound sources with sound-specialized properties, such as intensity, timbre, and volume. Additionally, we design optical flow-based guidance to generate temporally consistent video frames, capturing the pixel-wise relationship between adjacent frames. Experimental results show that our method outperforms existing video editing techniques, producing more realistic visual effects that reflect the properties of sound. Please visit our page: https://kuai-lab.github.io/soundini-gallery/.
翻译:我们提出一种零样本设置下对视频特定区域添加声音引导视觉效果的方法。动态视觉效果的动画化存在挑战,因为编辑后视频的每一帧需在保持时间一致性的同时产生视觉变化。此外,现有视频编辑方案聚焦于帧间时间一致性,忽略了随时间变化的视觉风格(例如雷暴、波浪、火焰爆裂声)。为克服这一局限,我们利用时间声音特征来驱动动态风格。具体而言,我们在音视频联合潜在空间中用音频潜在表示引导去噪扩散概率模型。据我们所知,本工作是首个探索从多种声源中利用声音专属属性(如强度、音色、音量)进行声音引导自然视频编辑的研究。同时,我们设计了基于光流的引导机制来生成时间一致的视频帧,捕获相邻帧间像素级关联。实验结果表明,我们的方法优于现有视频编辑技术,能生成更逼真且反映声音属性的视觉效果。请访问项目页面:https://kuai-lab.github.io/soundini-gallery/。