Audio-Visual Segmentation (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic Segmentation (AVSS), as an extension of AVS, further pursues semantic understanding of audio-visual scenes. However, since the AVSS task requires the establishment of audio-visual correspondence and semantic understanding simultaneously, we observe that previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization. Therefore, we propose a two-stage training strategy called \textit{Stepping Stones}, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization. This training strategy has also proved its generalization and effectiveness on existing methods. To further improve the performance of AVS tasks, we propose a novel framework Adaptive Audio Visual Segmentation, in which we incorporate an adaptive audio query generator and integrate masked attention into the transformer decoder, facilitating the adaptive fusion of visual and audio features. Extensive experiments demonstrate that our methods achieve state-of-the-art results on all three AVS benchmarks. The project homepage can be accessed at https://gewu-lab.github.io/stepping_stones/.
翻译:视听分割(AVS)旨在实现视频中声源的像素级定位,而视听语义分割(AVSS)作为AVS的延伸,进一步追求对视听场景的语义理解。然而,由于AVSS任务需要同时建立视听对应关系并完成语义理解,我们观察到先前的方法在端到端训练中难以处理这一混合目标,导致学习不充分和次优优化。为此,我们提出一种名为\textit{阶梯式训练}的两阶段训练策略,将AVSS任务分解为从定位到语义理解的两个简单子任务,每个阶段均进行充分优化,以实现逐步的全局优化。该训练策略在现有方法上也证明了其泛化性和有效性。为进一步提升AVS任务的性能,我们提出一种新颖的自适应视听分割框架,其中引入了自适应音频查询生成器,并将掩码注意力机制集成到Transformer解码器中,促进了视觉与音频特征的自适应融合。大量实验表明,我们的方法在全部三个AVS基准测试中均取得了最先进的性能。项目主页可通过 https://gewu-lab.github.io/stepping_stones/ 访问。