Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, and COG-MHEAR challenge. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.
翻译:在基于音频的语音分离中加入视觉线索可以提升分离性能。本文介绍了AV-CrossNet,一个用于语音增强、目标说话人提取以及多说话人语音分离的视听系统。AV-CrossNet是在CrossNet架构基础上扩展而来,后者是近期提出的一种通过利用全局注意力与位置编码来实现语音分离的复合频谱映射网络。为有效利用视觉线索,所提出的系统整合了预提取的视觉嵌入,并采用了一个由时序卷积层构成的视觉编码器。音频与视觉特征在输入AV-CrossNet模块前,会先在一个早期融合层中进行融合。我们在多个数据集上评估了AV-CrossNet,包括LRS、VoxCeleb以及COG-MHEAR挑战赛数据集。评估结果表明,AV-CrossNet在所有视听任务中均推进了当前最优性能,即使在未经训练及不匹配的数据集上也是如此。