The task of Visual Sound Source Localization (VSSL) involves identifying the location of sound sources in visual scenes, integrating audio-visual data for enhanced scene understanding. Despite advancements in state-of-the-art (SOTA) models, we observe three critical flaws: i) The evaluation of the models is mainly focused in sounds produced by objects that are visible in the image, ii) The evaluation often assumes a prior knowledge of the size of the sounding object, and iii) No universal threshold for localization in real-world scenarios is established, as previous approaches only consider positive examples without accounting for both positive and negative cases. In this paper, we introduce a novel test set and metrics designed to complete the current standard evaluation of VSSL models by testing them in scenarios where none of the objects in the image corresponds to the audio input, i.e. a negative audio. We consider three types of negative audio: silence, noise and offscreen. Our analysis reveals that numerous SOTA models fail to appropriately adjust their predictions based on audio input, suggesting that these models may not be leveraging audio information as intended. Additionally, we provide a comprehensive analysis of the range of maximum values in the estimated audio-visual similarity maps, in both positive and negative audio cases, and show that most of the models are not discriminative enough, making them unfit to choose a universal threshold appropriate to perform sound localization without any a priori information of the sounding object, that is, object size and visibility.
翻译:视觉声源定位(VSSL)任务涉及识别视觉场景中声源的位置,通过整合视听数据以增强场景理解。尽管现有最先进(SOTA)模型已取得进展,但我们观察到三个关键缺陷:i) 模型评估主要集中于图像中可见物体产生的声音;ii) 评估通常预设了对发声物体尺寸的先验知识;iii) 由于先前方法仅考虑正例而未同时处理正负案例,尚未建立适用于真实场景的通用定位阈值。本文引入了一种新颖的测试集与评估指标,旨在通过测试图像中没有任何物体与音频输入(即负音频)对应的场景,完善当前VSSL模型的标准评估体系。我们考虑三种负音频类型:静音、噪声与屏外声源。分析表明,众多SOTA模型未能根据音频输入适当调整其预测,暗示这些模型可能未按预期利用音频信息。此外,我们通过综合分析正负音频情况下估计的视听相似度图的最大值范围,揭示大多数模型的判别能力不足,导致其无法在缺乏发声物体(即物体尺寸与可见性)先验信息的情况下,选择适用于声源定位的通用阈值。