Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores.
翻译:准确识别包含重叠说话人、噪声和混响的鸡尾酒会语音至今仍是一项极具挑战性的任务。受视觉模态对声学信号干扰具有不变性的启发,本文提出了一种将视觉信息全面融入所有系统组件的视听多通道语音分离、去混响与识别方法。视频输入的有效性在基于掩蔽的MVDR语音分离、基于DNN-WPE或频谱映射(SpecM)的语音去混响前端以及Conformer ASR后端中均得到一致验证。本文研究了通过掩蔽WPD以流水线或联合方式执行语音分离与去混响的视听一体化前端架构。通过单独使用ASR代价函数或其与语音增强损失的插值进行端到端联合微调,最小化了语音增强前端与ASR后端组件之间的误差代价失配问题。实验基于牛津LRS2数据集模拟或重放构建的重叠混响混合语音数据展开。所提出的视听多通道语音分离、去混响与识别系统相较于纯音频基线模型,在词错误率(WER)上实现了9.1%和6.2%的绝对下降(41.7%和36.0%的相对下降),并在PESQ、STOI和SRMR评分上均获得一致的语音增强性能提升。