We propose an Explicit Conditional Multimodal Variational Auto-Encoder (ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the video sequence. Existing AVS methods focus on implicit feature fusion strategies, where models are trained to fit the discrete samples in the dataset. With a limited and less diverse dataset, the resulting performance is usually unsatisfactory. In contrast, we address this problem from an effective representation learning perspective, aiming to model the contribution of each modality explicitly. Specifically, we find that audio contains critical category information of the sound producers, and visual data provides candidate sound producer(s). Their shared information corresponds to the target sound producer(s) shown in the visual data. In this case, cross-modal shared representation learning is especially important for AVS. To achieve this, our ECMVAE factorizes the representations of each modality with a modality-shared representation and a modality-specific representation. An orthogonality constraint is applied between the shared and specific representations to maintain the exclusive attribute of the factorized latent code. Further, a mutual information maximization regularizer is introduced to achieve extensive exploration of each modality. Quantitative and qualitative evaluations on the AVSBench demonstrate the effectiveness of our approach, leading to a new state-of-the-art for AVS, with a 3.84 mIOU performance leap on the challenging MS3 subset for multiple sound source segmentation.
翻译:我们提出了一种显式条件多模态变分自编码器(ECMVAE)用于音频-视觉分割(AVS),旨在对视频序列中的声源进行分割。现有的AVS方法侧重于隐式特征融合策略,模型被训练以拟合数据集中的离散样本。在数据集有限且多样性不足的情况下,所得性能通常不尽如人意。相比之下,我们从有效表示学习的角度解决该问题,旨在显式建模每种模态的贡献。具体而言,我们发现音频包含声源生产者的关键类别信息,而视觉数据提供了候选声源生产者。它们的共享信息对应于视觉数据中显示的目标声源生产者。在此情况下,跨模态共享表示学习对AVS尤为重要。为实现这一目标,我们的ECMVAE将每种模态的表示分解为模态共享表示和模态特定表示。在共享表示与特定表示之间施加正交约束,以保持分解后潜在编码的独占属性。此外,引入互信息最大化正则化器以实现对每种模态的充分探索。在AVSBench上的定量和定性评估证明了我们方法的有效性,在具有挑战性的MS3多声源分割子集上实现了3.84 mIOU的性能提升,为AVS树立了新的最先进水平。