Continual Audio-Visual Sound Separation

In this paper, we introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes while preserving performance on previously learned classes, with the aid of visual guidance. This problem is crucial for practical visually guided auditory perception as it can significantly enhance the adaptability and robustness of audio-visual sound separation models, making them more applicable for real-world scenarios where encountering new sound sources is commonplace. The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning. To address these challenges, we propose a novel approach named ContAV-Sep (\textbf{Cont}inual \textbf{A}udio-\textbf{V}isual Sound \textbf{Sep}aration). ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting. The CrossSDC can seamlessly integrate into the training process of different audio-visual sound separation frameworks. Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines for audio-visual sound separation. Code is available at: \url{https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024}.

翻译：本文提出了一种新颖的持续音频-视觉声音分离任务，旨在借助视觉引导，持续分离新类别声源的同时保持对已学习类别的分离性能。该任务对于实际应用中的视觉引导听觉感知至关重要，因为它能显著提升音频-视觉声音分离模型的适应性与鲁棒性，使其更适用于新声源频繁出现的现实场景。该任务本质上具有挑战性，因为模型不仅需要有效利用当前任务中的多模态信息，还需在旧任务中保持跨模态关联性，以缓解音频-视觉持续学习中的灾难性遗忘问题。为应对这些挑战，我们提出名为ContAV-Sep（持续音频-视觉声音分离）的新方法。ContAV-Sep引入了一种新颖的跨模态相似性蒸馏约束，通过增量任务维持跨模态语义相似性，并保留旧模型中已习得的语义相似性知识，从而降低灾难性遗忘风险。该约束可无缝集成到不同音频-视觉声音分离框架的训练过程中。实验表明，相较于其他音频-视觉声音分离的持续学习基线方法，ContAV-Sep能有效缓解灾难性遗忘并取得显著更优的性能。代码发布于：\url{https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024}。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日