Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding visual biases, we examine two representative AVSL benchmarks, VGG-SS and EpicSounding-Object, where the vision-only models outperform all audiovisual baselines. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
翻译:视听源定位(AVSL)旨在定位视频中声音的来源。本文指出现有基准存在一个显著问题:发声物体往往仅凭视觉线索即可被轻易识别,我们称之为视觉偏差。此类偏差阻碍了这些基准对AVSL模型的有效评估。为进一步验证关于视觉偏差的假设,我们考察了两个代表性AVSL基准——VGG-SS与EpicSounding-Object,其中纯视觉模型的性能超越了所有视听基线方法。研究结果表明,现有AVSL基准需进一步优化以促进视听学习的发展。