Self-supervised audio-visual source localization aims to locate sound-source objects in video frames without extra annotations. Recent methods often approach this goal with the help of contrastive learning, which assumes only the audio and visual contents from the same video are positive samples for each other. However, this assumption would suffer from false negative samples in real-world training. For example, for an audio sample, treating the frames from the same audio class as negative samples may mislead the model and therefore harm the learned representations e.g., the audio of a siren wailing may reasonably correspond to the ambulances in multiple images). Based on this observation, we propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with such false negative samples. Specifically, we utilize the intra-modal similarities to identify potentially similar samples and construct corresponding adjacency matrices to guide contrastive learning. Further, we propose to strengthen the role of true negative samples by explicitly leveraging the visual features of sound sources to facilitate the differentiation of authentic sounding source regions. FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench, which demonstrates the effectiveness of our method in mitigating the false negative issue. The code is available at \url{https://github.com/weixuansun/FNAC-AVL}.
翻译:自监督视听源定位旨在无需额外标注的情况下,在视频帧中定位发出声音的物体。当前方法通常借助对比学习实现该目标,其核心假设是同一视频中的音频与视觉内容互为正样本。然而,该假设在真实场景训练中会遭遇假阴性样本问题。以音频样本为例,将同音频类别的视频帧视为负样本会误导模型,进而损害学习到的表征(例如警笛声可能对应多张图像中的救护车)。基于此观察,我们提出名为"假阴性感知对比学习"(FNAC)的新策略,以缓解此类假阴性样本对训练的误导。具体而言,我们利用模态内相似性识别潜在相似样本,构建对应邻接矩阵来引导对比学习。此外,我们通过显式利用声源的视觉特征强化真阴性样本的作用,以促进对真实发声区域的有效区分。FNAC在Flickr-SoundNet、VGG-Sound和AVSBench数据集上均取得最优性能,验证了该方法在缓解假阴性问题上的有效性。相关代码已开源至 \url{https://github.com/weixuansun/FNAC-AVL}。