一种用于因果视听语音分离的快速轻量级模型 (A Fast and Lightweight Model for Causal Audio-Visual Speech Separation)

Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. Inspired by the pipeline of RTFSNet, we propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audio-visual integration. Additionally, Swift-Net employs Grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical information. We further propose a causal transformation template to facilitate the conversion of non-causal AVSS models into causal counterparts. Experiments on three standard benchmark datasets (LRS2, LRS3, and VoxCeleb2) demonstrated that under causal conditions, our proposed Swift-Net exhibited outstanding performance, highlighting the potential of this method for processing speech in complex environments.

翻译：视听语音分离（AVSS）旨在通过利用听觉和视觉（唇部运动）线索从混合信号中提取目标语音信号。然而，现有的大多数AVSS方法架构复杂，且依赖未来上下文进行离线操作，这使其不适用于实时应用。受RTFNet处理流程的启发，我们提出了一种新颖的流式AVSS模型，命名为Swift-Net，该模型增强了实时应用所需的因果处理能力。Swift-Net采用了一个轻量级的视觉特征提取模块和一个用于视听融合的高效融合模块。此外，Swift-Net采用分组SRU来整合不同特征空间中的历史信息，从而提高了历史信息的利用效率。我们进一步提出了一种因果转换模板，以促进将非因果AVSS模型转换为因果对应模型。在三个标准基准数据集（LRS2、LRS3和VoxCeleb2）上的实验表明，在因果条件下，我们提出的Swift-Net表现出优异的性能，突显了该方法在处理复杂环境中语音的潜力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/