Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.
翻译:近年来,多模态持续学习领域取得了显著进展,其目标是在多模态环境中顺序学习新任务,同时保持对先前学习任务的性能。然而,现有方法主要关注粗粒度任务,在解决细粒度持续学习环境中的模态纠缠方面存在局限。为弥补这一差距,我们引入了一种新颖的持续音频-视觉分割任务,旨在音频引导下持续分割新类别。通过综合分析,我们识别出两个关键挑战:1)多模态语义漂移,即在顺序任务中发声物体被标记为背景;2)共现混淆,即频繁共现的类别容易相互混淆。本工作设计了一个基于碰撞的多模态回放框架以应对这些挑战。具体而言,针对多模态语义漂移,我们提出了多模态样本选择策略,以选择具有高模态一致性的样本进行回放。同时,针对共现混淆,我们设计了基于碰撞的样本回放机制,允许在训练过程中增加这些易混淆类别的回放样本频率。此外,我们构建了三种音频-视觉增量场景以验证方法的有效性。综合实验表明,我们的方法显著优于单模态持续学习方法。