A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio

The task of Visual Sound Source Localization (VSSL) involves identifying the location of sound sources in visual scenes, integrating audio-visual data for enhanced scene understanding. Despite advancements in state-of-the-art (SOTA) models, we observe three critical flaws: i) The evaluation of the models is mainly focused in sounds produced by objects that are visible in the image, ii) The evaluation often assumes a prior knowledge of the size of the sounding object, and iii) No universal threshold for localization in real-world scenarios is established, as previous approaches only consider positive examples without accounting for both positive and negative cases. In this paper, we introduce a novel test set and metrics designed to complete the current standard evaluation of VSSL models by testing them in scenarios where none of the objects in the image corresponds to the audio input, i.e. a negative audio. We consider three types of negative audio: silence, noise and offscreen. Our analysis reveals that numerous SOTA models fail to appropriately adjust their predictions based on audio input, suggesting that these models may not be leveraging audio information as intended. Additionally, we provide a comprehensive analysis of the range of maximum values in the estimated audio-visual similarity maps, in both positive and negative audio cases, and show that most of the models are not discriminative enough, making them unfit to choose a universal threshold appropriate to perform sound localization without any a priori information of the sounding object, that is, object size and visibility.

翻译：视觉声源定位（VSSL）任务涉及识别视觉场景中声源的位置，通过整合视听数据以增强场景理解。尽管现有最先进（SOTA）模型已取得进展，但我们观察到三个关键缺陷：i) 模型评估主要集中于图像中可见物体产生的声音；ii) 评估通常预设了对发声物体尺寸的先验知识；iii) 由于先前方法仅考虑正例而未同时处理正负案例，尚未建立适用于真实场景的通用定位阈值。本文引入了一种新颖的测试集与评估指标，旨在通过测试图像中没有任何物体与音频输入（即负音频）对应的场景，完善当前VSSL模型的标准评估体系。我们考虑三种负音频类型：静音、噪声与屏外声源。分析表明，众多SOTA模型未能根据音频输入适当调整其预测，暗示这些模型可能未按预期利用音频信息。此外，我们通过综合分析正负音频情况下估计的视听相似度图的最大值范围，揭示大多数模型的判别能力不足，导致其无法在缺乏发声物体（即物体尺寸与可见性）先验信息的情况下，选择适用于声源定位的通用阈值。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日