The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio-visual modeling. In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS. To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Moreover, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods. To showcase the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. To facilitate reproducibility, we plan to release both the source code and the pre-trained model.
翻译:音频-视觉分割(AVS)旨在精确区分视频中可发声物体至像素级别。传统方法通常通过多模态信息融合应对这一挑战,其中各模态的贡献被显式或隐式建模。然而,音频-视觉建模往往忽视了不同模态间的内在关联。受人类能够通过心理模拟想象物体声音及视觉外观的能力启发,本文提出一种双向生成框架。该框架在物体视觉特征与其关联声音间建立稳健关联,从而提升AVS性能。为此,我们采用视觉到音频投影组件,从物体分割掩膜中重建音频特征并最小化重建误差。此外,鉴于许多声音与物体运动相关,我们引入隐式体积运动估计模块处理时间动态特性——这类动态难以通过传统光流法有效捕获。为验证方法有效性,我们在广泛认可的AVSBench基准上开展全面实验与分析。最终,我们在AVS基准上创下新的最优性能,尤其在涉及多声源分割的具有挑战性的MS3子集中表现卓越。为促进可复现性,我们计划公开源代码及预训练模型。