The audio-visual sound separation field assumes visible sources in videos, but this excludes invisible sounds beyond the camera's view. Current methods struggle with such sounds lacking visible cues. This paper introduces a novel "Audio-Visual Scene-Aware Separation" (AVSA-Sep) framework. It includes a semantic parser for visible and invisible sounds and a separator for scene-informed separation. AVSA-Sep successfully separates both sound types, with joint training and cross-modal alignment enhancing effectiveness.
翻译:音频-视觉声音分离领域假设视频中存在可见声源,但这排除了摄像机视野之外的不可见声音。现有方法难以处理这类缺乏视觉线索的声音。本文提出了一种新颖的"视听场景感知分离"(AVSA-Sep)框架,该框架包含针对可见与不可见声音的语义解析器,以及基于场景信息进行声音分离的分离器。AVSA-Sep成功实现了两类声音的分离,联合训练与跨模态对齐有效提升了分离效果。