Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio -- without acoustically mismatched source audio for reference. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric that quantifies the level of residual acoustic information in the de-biased audio. Training with either in-the-wild web data or simulated data, we demonstrate it outperforms the state-of-the-art on multiple challenging datasets and a wide variety of real-world audio and environments.
翻译:声学匹配旨在重新合成音频片段,使其听起来像是在目标声学环境中录制的。现有方法假设可以获取成对训练数据,即音频在源环境和目标环境中均被记录,但这限制了训练数据的多样性,或需要使用模拟数据或启发式方法来创建成对样本。我们提出了一种自监督的视觉声学匹配方法,其训练样本仅包含目标场景图像和音频——无需声学失配的源音频作为参考。我们的方法通过条件生成对抗网络框架及一种新颖的度量指标(该指标量化了去偏音频中残余声学信息的水平),联合学习解耦房间声学特性并将音频重新合成为目标环境。通过使用野外网络数据或模拟数据进行训练,我们证明该方法在多个具有挑战性的数据集以及各种真实世界音频和环境中的表现均优于当前最先进技术。