Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without any bounding-box annotations, they struggle to achieve precise localization, especially for small objects, and suffer from blurry boundaries and false positives. Moreover, the naive semi-supervised method is poor in fully leveraging the information of abundant unlabeled data. In this paper, we propose a novel semi-supervised learning framework for AVSL, namely Dual Mean-Teacher (DMT), comprising two teacher-student structures to circumvent the confirmation bias issue. Specifically, two teachers, pre-trained on limited labeled data, are employed to filter out noisy samples via the consensus between their predictions, and then generate high-quality pseudo-labels by intersecting their confidence maps. The sufficient utilization of both labeled and unlabeled data and the proposed unbiased framework enable DMT to outperform current state-of-the-art methods by a large margin, with CIoU of 90.4% and 48.8% on Flickr-SoundNet and VGG-Sound Source, obtaining 8.9%, 9.6% and 4.6%, 6.4% improvements over self- and semi-supervised methods respectively, given only 3% positional-annotations. We also extend our framework to some existing AVSL methods and consistently boost their performance.
翻译:视听源定位(AVSL)旨在根据配对的音频片段,在视频帧中定位发声物体。现有方法主要依赖于音频-视觉对应关系的自监督对比学习。由于缺乏边界框标注,它们难以实现精确定位,尤其对于小物体,并存在边界模糊和误检问题。此外,朴素半监督方法在充分利用大量未标注数据的信息方面表现欠佳。本文提出了一种新颖的AVSL半监督学习框架——双均值教师(DMT),该框架包含两个教师-学生结构,用于规避确认偏差问题。具体而言,两个在有限标注数据上预训练的教师,通过其预测结果的一致性筛选噪声样本,并利用置信度图的交集生成高质量的伪标签。对标注与未标注数据的充分利用以及所提出的无偏框架,使DMT以大幅优势超越当前最先进方法:在仅使用3%位置标注的情况下,在Flickr-SoundNet和VGG-Sound Source上CIoU分别达到90.4%和48.8%,相较于自监督和半监督方法分别提升了8.9%、9.6%和4.6%、6.4%。我们还将该框架扩展至现有AVSL方法,并持续提升了它们的性能。