Vision-to-audio (V2A) synthesis has broad applications in multimedia. Recent advancements of V2A methods have made it possible to generate relevant audios from inputs of videos or still images. However, the immersiveness and expressiveness of the generation are limited. One possible problem is that existing methods solely rely on the global scene and overlook details of local sounding objects (i.e., sound sources). To address this issue, we propose a Sound Source-Aware V2A (SSV2A) generator. SSV2A is able to locally perceive multimodal sound sources from a scene with visual detection and cross-modality translation. It then contrastively learns a Cross-Modal Sound Source (CMSS) Manifold to semantically disambiguate each source. Finally, we attentively mix their CMSS semantics into a rich audio representation, from which a pretrained audio generator outputs the sound. To model the CMSS manifold, we curate a novel single-sound-source visual-audio dataset VGGS3 from VGGSound. We also design a Sound Source Matching Score to measure localized audio relevance. By addressing V2A generation at the sound-source level, SSV2A surpasses state-of-the-art methods in both generation fidelity and relevance as evidenced by extensive experiments. We further demonstrate SSV2A's ability to achieve intuitive V2A control by compositing vision, text, and audio conditions. Our generation can be tried and heard at https://ssv2a.github.io/SSV2A-demo .
翻译:视觉到音频(V2A)合成在多媒体领域具有广泛的应用。V2A方法的最新进展使得从视频或静态图像的输入生成相关音频成为可能。然而,生成的沉浸感和表现力仍然有限。一个可能的问题是现有方法仅依赖于全局场景,而忽略了局部发声物体(即声源)的细节。为解决此问题,我们提出了一种声源感知的V2A(SSV2A)生成器。SSV2A能够通过视觉检测和跨模态转换,从场景中局部感知多模态声源。随后,它通过对比学习构建一个跨模态声源(CMSS)流形,以在语义上对每个声源进行消歧。最后,我们通过注意力机制将它们的CMSS语义混合成一个丰富的音频表示,并由一个预训练的音频生成器输出声音。为了建模CMSS流形,我们从VGGSound数据集中整理了一个新颖的单一声源视觉-音频数据集VGGS3。我们还设计了一个声源匹配分数来衡量局部音频的相关性。通过在声源层面处理V2A生成问题,SSV2A在生成保真度和相关性方面均超越了现有最先进方法,大量实验证明了这一点。我们进一步展示了SSV2A通过组合视觉、文本和音频条件实现直观V2A控制的能力。我们的生成结果可在 https://ssv2a.github.io/SSV2A-demo 上试听。