Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adhere to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a block, which, when stacked, enables capturing audio-visual fine-grained combinatorial-dependence in a memory-efficient manner. Additionally, we introduce audio-constrained queries during the decoding phase. These queries contain rich object-level information, ensuring the decoded mask adheres to the sounds. Experimental results confirm our approach's effectiveness, with our framework achieving a new SOTA performance on all three datasets using two backbones. The code is available at \url{https://github.com/aspirinone/CATR.github.io}
翻译:音视频分割(AVVS)旨在生成图像帧中发声物体的像素级掩码,并确保这些掩码严格遵循给定的音频,例如在视频中识别并分割出唱歌的人。然而,现有方法存在两个局限性:1)它们分别处理视频时序特征和音视频交互特征,忽视了音频与视频组合的固有时空依赖性;2)在解码阶段未能充分引入音频约束和目标级别信息,导致分割结果不符合音频指令。为解决这些问题,我们提出一种解耦音视频Transformer,该模型从各自的时间和空间维度上组合音频和视频特征,捕捉其联合依赖性。为优化内存消耗,我们设计了一个模块,通过堆叠该模块,能够以内存高效的方式捕捉音频与视频的细粒度组合依赖性。此外,我们在解码阶段引入音频约束查询,这些查询包含丰富的目标级别信息,确保解码的掩码与声音一致。实验结果验证了我们方法的有效性,所提框架在三个数据集上采用两种骨干网络均达到了新的最优性能。代码已开源在 \url{https://github.com/aspirinone/CATR.github.io}。