Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN

翻译：自监督声音源定位通常面临模态不一致性的挑战。近年来，基于对比学习的策略在建立音频与视觉场景中声音源之间的一致性对应关系方面展现出前景。然而，对不同模态特征中异质性影响的关注不足仍制约着该方案的进一步优化，这也成为本研究的动机。本文提出一种感应网络以更有效地弥合模态间隙。通过解耦视觉与音频模态的梯度，可借助所设计的感应向量以自举方式学习声音源的判别性视觉表征，同时使音频模态与视觉模态保持一致性对齐。除了视觉加权对比损失外，还引入自适应阈值选择策略以增强感应网络的鲁棒性。在SoundNet-Flickr与VGG-Sound Source数据集上进行的大量实验表明，在不同具有挑战性的场景中，该方法相较于其他最先进工作表现出更优性能。代码开源地址：https://github.com/Tahy1/AVIN

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日