We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to annotations during training. To overcome this challenge, we adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions and encourage a stronger alignment between the audio, visual and natural language modalities. During inference, our approach can separate sounds given text, video and audio input, or given text and audio input alone. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets, including MUSIC, SOLOS and AudioSet, where we outperform state-of-the-art strongly supervised approaches despite not using object detectors or text labels during training.
翻译:我们提出一种自监督方法,通过学习自然语言查询实现视频中的音频源分离,仅使用未标注的视频和音频对作为训练数据。该任务的核心挑战在于:如何将发声物体的语言描述与其视觉特征及音频波形中的对应分量相关联,且训练过程中全程无法访问标注信息。为克服这一挑战,我们通过两种新颖的损失函数,将现成的视觉-语言基础模型适配为伪目标监督信号,从而增强音频、视觉和自然语言模态之间的对齐。在推理阶段,我们的方法可基于文本、视频和音频输入,或仅基于文本和音频输入实现声音分离。我们在MUSIC、SOLOS和AudioSet三个音频-视觉分离数据集上验证了该自监督方法的有效性,结果表明,尽管训练过程中未使用物体检测器或文本标签,我们的方法仍超越了当前最先进的强监督方法。