In this paper, we introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes while preserving performance on previously learned classes, with the aid of visual guidance. This problem is crucial for practical visually guided auditory perception as it can significantly enhance the adaptability and robustness of audio-visual sound separation models, making them more applicable for real-world scenarios where encountering new sound sources is commonplace. The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning. To address these challenges, we propose a novel approach named ContAV-Sep (\textbf{Cont}inual \textbf{A}udio-\textbf{V}isual Sound \textbf{Sep}aration). ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting. The CrossSDC can seamlessly integrate into the training process of different audio-visual sound separation frameworks. Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines for audio-visual sound separation. Code is available at: \url{https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024}.
翻译:本文提出了一种新颖的持续音频-视觉声音分离任务,旨在借助视觉引导,持续分离新类别声源的同时保持对已学习类别的分离性能。该任务对于实际应用中的视觉引导听觉感知至关重要,因为它能显著提升音频-视觉声音分离模型的适应性与鲁棒性,使其更适用于新声源频繁出现的现实场景。该任务本质上具有挑战性,因为模型不仅需要有效利用当前任务中的多模态信息,还需在旧任务中保持跨模态关联性,以缓解音频-视觉持续学习中的灾难性遗忘问题。为应对这些挑战,我们提出名为ContAV-Sep(持续音频-视觉声音分离)的新方法。ContAV-Sep引入了一种新颖的跨模态相似性蒸馏约束,通过增量任务维持跨模态语义相似性,并保留旧模型中已习得的语义相似性知识,从而降低灾难性遗忘风险。该约束可无缝集成到不同音频-视觉声音分离框架的训练过程中。实验表明,相较于其他音频-视觉声音分离的持续学习基线方法,ContAV-Sep能有效缓解灾难性遗忘并取得显著更优的性能。代码发布于:\url{https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024}。