Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN
翻译:自监督声音源定位通常面临模态不一致性的挑战。近年来,基于对比学习的策略在建立音频与视觉场景中声音源之间的一致性对应关系方面展现出前景。然而,对不同模态特征中异质性影响的关注不足仍制约着该方案的进一步优化,这也成为本研究的动机。本文提出一种感应网络以更有效地弥合模态间隙。通过解耦视觉与音频模态的梯度,可借助所设计的感应向量以自举方式学习声音源的判别性视觉表征,同时使音频模态与视觉模态保持一致性对齐。除了视觉加权对比损失外,还引入自适应阈值选择策略以增强感应网络的鲁棒性。在SoundNet-Flickr与VGG-Sound Source数据集上进行的大量实验表明,在不同具有挑战性的场景中,该方法相较于其他最先进工作表现出更优性能。代码开源地址:https://github.com/Tahy1/AVIN