The framework of visually-guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual feature extractor for informative visual guidance and separately devise module for feature fusion, while utilizing U-Net by default for sound analysis. However, such divide-and-conquer paradigm is parameter inefficient and, meanwhile, may obtain suboptimal performance as jointly optimizing and harmonizing various model components is challengeable. By contrast, this paper presents a novel approach, dubbed audio-visual predictive coding (AVPC), to tackle this task in a parameter efficient and more effective manner. The network of AVPC features a simple ResNet-based video analysis network for deriving semantic visual features, and a predictive coding-based sound separation network that can extract audio features, fuse multimodal information, and predict sound separation masks in the same architecture. By iteratively minimizing the prediction error between features, AVPC integrates audio and visual information recursively, leading to progressively improved performance. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source. Extensive evaluations demonstrate that AVPC outperforms several baselines in separating musical instrument sounds, while reducing the model size significantly. Code is available at: https://github.com/zjsong/Audio-Visual-Predictive-Coding.
翻译:视觉引导的声源分离框架通常由三部分组成:视觉特征提取、多模态特征融合和声音信号处理。该领域的一个持续趋势是定制信息丰富的视觉特征提取器以提供有效视觉引导,并单独设计特征融合模块,同时默认使用U-Net进行声音分析。然而,这种分而治之的范式参数效率较低,且由于联合优化和协调不同模型组件具有挑战性,可能无法获得最优性能。相比之下,本文提出一种名为音视频预测编码(AVPC)的新方法,以参数高效且更有效的方式解决该任务。AVPC网络采用简单的基于ResNet的视频分析网络提取语义视觉特征,并基于预测编码的声音分离网络在同一架构中提取音频特征、融合多模态信息并预测声音分离掩码。通过迭代最小化特征间的预测误差,AVPC递归地整合音频与视觉信息,从而逐步提升性能。此外,我们为AVPC开发了一种有效的自监督学习策略,通过联合预测同一声源的两种音视频表征实现。大量评估表明,AVPC在分离乐器声音方面优于多个基线模型,同时显著减小了模型规模。代码地址:https://github.com/zjsong/Audio-Visual-Predictive-Coding。