Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed, we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames, leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS, we explicitly divide them into two categories based on their temporal characteristics, i.e., neighboring frame (NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame, often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs, DFs have long temporal distances from the labeled frame, which share semantic-similar objects with appearance variations. Considering their unique characteristics, we propose a versatile framework that effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides, we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames, which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method, unleashing the power of the abundant unlabeled frames.
翻译:视听分割(AVS)旨在对视频帧中的发声物体进行分割。尽管已取得显著进展,但实验表明,当前方法在利用未标记帧时性能增益有限,存在未充分利用的问题。为充分挖掘未标记帧在AVS中的潜力,我们根据其时序特征将其明确分为两类:邻近帧(NF)和远距离帧(DF)。NF与标记帧在时间上相邻,通常包含丰富的运动信息,有助于精确定位发声物体。与NF相反,DF与标记帧存在较长的时序距离,共享语义相似的物体但外观存在变化。针对这些独特特性,我们提出了一种通用框架,有效利用它们来解决AVS问题。具体地,对于NF,我们利用运动线索作为动态指导,提升目标定位能力。此外,我们利用DF中的语义线索,将其视为对标记帧的有效增强,进而通过自训练方式丰富数据多样性。大量实验结果表明,我们的方法具有通用性和优越性,充分释放了丰富未标记帧的潜力。