Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. Inspired by the pipeline of RTFSNet, we propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audio-visual integration. Additionally, Swift-Net employs Grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical information. We further propose a causal transformation template to facilitate the conversion of non-causal AVSS models into causal counterparts. Experiments on three standard benchmark datasets (LRS2, LRS3, and VoxCeleb2) demonstrated that under causal conditions, our proposed Swift-Net exhibited outstanding performance, highlighting the potential of this method for processing speech in complex environments.
翻译:视听语音分离(AVSS)旨在通过利用听觉和视觉(唇部运动)线索从混合信号中提取目标语音信号。然而,现有的大多数AVSS方法架构复杂,且依赖未来上下文进行离线操作,这使其不适用于实时应用。受RTFNet处理流程的启发,我们提出了一种新颖的流式AVSS模型,命名为Swift-Net,该模型增强了实时应用所需的因果处理能力。Swift-Net采用了一个轻量级的视觉特征提取模块和一个用于视听融合的高效融合模块。此外,Swift-Net采用分组SRU来整合不同特征空间中的历史信息,从而提高了历史信息的利用效率。我们进一步提出了一种因果转换模板,以促进将非因果AVSS模型转换为因果对应模型。在三个标准基准数据集(LRS2、LRS3和VoxCeleb2)上的实验表明,在因果条件下,我们提出的Swift-Net表现出优异的性能,突显了该方法在处理复杂环境中语音的潜力。