While existing Audio-Visual Speech Separation (AVSS) methods primarily concentrate on the audio-visual fusion strategy for two-speaker separation, they demonstrate a severe performance drop in the multi-speaker separation scenarios. Typically, AVSS methods employ guiding videos to sequentially isolate individual speakers from the given audio mixture, resulting in notable missing and noisy parts across various segments of the separated speech. In this study, we propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers within a singular process. We introduce speaker-wise interactions to establish distinctions and correlations among speakers. Experimental results on the VoxCeleb2 and LRS3 datasets demonstrate that our method achieves state-of-the-art performance in separating mixtures with 2, 3, 4, and 5 speakers, respectively. Additionally, our model can utilize speakers with complete audio-visual information to mitigate other visual-deficient speakers, thereby enhancing its resilience to missing visual cues. We also conduct experiments where visual information for specific speakers is entirely absent or visual frames are partially missing. The results demonstrate that our model consistently outperforms others, exhibiting the smallest performance drop across all settings involving 2, 3, 4, and 5 speakers.
翻译:现有的视听语音分离方法主要集中于双说话人分离的视听融合策略,但在多说话人分离场景中表现出严重的性能下降。通常,AVSS方法利用引导视频从给定的音频混合中顺序地分离出各个说话人,导致分离出的语音在不同片段中存在显著的缺失和噪声部分。在本研究中,我们提出了一种同步多说话人分离框架,能够在单一过程中促进多个说话人的并发分离。我们引入了基于说话人的交互机制,以建立说话人之间的区别与关联。在VoxCeleb2和LRS3数据集上的实验结果表明,我们的方法在分别分离包含2、3、4和5个说话人的混合音频时,均达到了最先进的性能。此外,我们的模型能够利用具有完整视听信息的说话人来辅助其他视觉信息缺失的说话人,从而增强其对缺失视觉线索的鲁棒性。我们还进行了特定说话人视觉信息完全缺失或视觉帧部分缺失的实验。结果表明,我们的模型在所有涉及2、3、4和5个说话人的设置中均持续优于其他方法,并且表现出最小的性能下降。