The objective of the sound source localization task is to enable machines to detect the location of sound-making objects within a visual scene. While the audio modality provides spatial cues to locate the sound source, existing approaches only use audio as an auxiliary role to compare spatial regions of the visual modality. Humans, on the other hand, utilize both audio and visual modalities as spatial cues to locate sound sources. In this paper, we propose an audio-visual spatial integration network that integrates spatial cues from both modalities to mimic human behavior when detecting sound-making objects. Additionally, we introduce a recursive attention network to mimic human behavior of iterative focusing on objects, resulting in more accurate attention regions. To effectively encode spatial information from both modalities, we propose audio-visual pair matching loss and spatial region alignment loss. By utilizing the spatial cues of audio-visual modalities and recursively focusing objects, our method can perform more robust sound source localization. Comprehensive experimental results on the Flickr SoundNet and VGG-Sound Source datasets demonstrate the superiority of our proposed method over existing approaches. Our code is available at: https://github.com/VisualAIKHU/SIRA-SSL
翻译:声源定位任务的目标是使机器能够检测视觉场景中发声物体的位置。虽然音频模态提供了定位声源的空间线索,但现有方法仅将音频作为辅助角色来比较视觉模态的空间区域。相比之下,人类同时利用听觉和视觉模态的空间线索来定位声源。本文提出一种音频-视觉空间整合网络,该网络融合两种模态的空间线索以模拟人类检测发声物体时的行为。此外,我们引入递归注意力网络来模拟人类对物体进行迭代聚焦的行为,从而获得更准确的注意力区域。为有效编码两种模态的空间信息,我们提出了音频-视觉配对匹配损失和空间区域对齐损失。通过利用音频-视觉模态的空间线索并递归聚焦物体,我们的方法能够实现更鲁棒的声源定位。在Flickr SoundNet和VGG-Sound Source数据集上的全面实验结果表明,我们的方法优于现有方法。我们的代码已开源:https://github.com/VisualAIKHU/SIRA-SSL