The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task. Historically, these abilities were tackled separately, with several methods developed independently for each task. However, given the interconnected nature of source localization, separation, and recognition, independent models are likely to yield suboptimal performance as they fail to capture the interdependence between these tasks. To address this problem, we propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition. OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives. The first objective aligns audio and visual representations through a localized audio-visual correspondence loss. The second tackles visual source separation using a traditional mix-and-separate framework. Finally, the third objective reinforces visual feature separation and localization by mixing images in pixel space and aligning their representations with those of all corresponding sound sources. Extensive experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks, audio-visual source localization, separation, and nearest neighbor recognition, and empirically demonstrate a strong positive transfer between them.
翻译:准确识别、定位和分离声源是任何音视频感知任务的基础。历史上,这些能力被分别处理,针对每个任务开发了多种独立方法。然而,鉴于声源定位、分离与识别之间的相互关联性,独立模型可能无法捕捉这些任务间的相互依赖关系,从而导致性能欠佳。为解决这一问题,我们提出了一种统一的音视频学习框架(名为OneAVM),该框架整合了音频和视觉线索,用于联合定位、分离与识别。OneAVM由共享的音视频编码器和任务特定解码器组成,并通过三种目标进行训练。第一个目标通过局部音视频对应损失来对齐音频与视觉表示。第二个目标利用传统的混合-分离框架处理视觉声源分离。最后,第三个目标通过在像素空间混合图像并使其表示与所有对应声源的表示对齐,来强化视觉特征分离与定位。在MUSIC、VGG-Instruments、VGG-Music和VGGSound数据集上的广泛实验证明了OneAVM在音视频声源定位、分离及最近邻识别这三项任务上的有效性,并从实验角度证实了这些任务之间存在强烈的正向迁移。