Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation. The task of dereverberation constitutes an important step to improve the audible quality and to reduce the error rates in applications like automatic speech recognition (ASR). We propose a unified framework of speech dereverberation for improving the speech quality and the ASR performance using the approach of envelope-carrier decomposition provided by an autoregressive (AR) model. The AR model is applied in the frequency domain of the sub-band speech signals to separate the envelope and carrier parts. A novel neural architecture based on dual path long short term memory (DPLSTM) model is proposed, which jointly enhances the sub-band envelope and carrier components. The dereverberated envelope-carrier signals are modulated and the sub-band signals are synthesized to reconstruct the audio signal back. The DPLSTM model for dereverberation of envelope and carrier components also allows the joint learning of the network weights for the down stream ASR task. In the ASR tasks on the REVERB challenge dataset as well as on the VOiCES dataset, we illustrate that the joint learning of speech dereverberation network and the E2E ASR model yields significant performance improvements over the baseline ASR system trained on log-mel spectrogram as well as other benchmarks for dereverberation (average relative improvements of 10-24% over the baseline system). The speech quality improvements, evaluated using subjective listening tests, further highlight the improved quality of the reconstructed audio.
翻译:在远场真实环境下的语音应用中,常常需要处理受到混响污染的语音信号。去混响作为改善听觉质量及降低自动语音识别(ASR)等应用错误率的关键步骤,具有重要研究价值。本文提出统一的语音去混响框架,通过自回归(AR)模型实现包络-载波分解,以提升语音质量与ASR性能。该AR模型在子带语音信号的频域中应用,分离包络与载波分量。我们提出基于双路径长短期记忆(DPLSTM)模型的新型神经网络架构,联合增强子带的包络与载波成分。经过去混响处理的包络-载波信号被调制后,通过子带信号合成重建音频信号。该DPLSTM模型在实现包络与载波分量去混响的同时,还可联合学习面向下游ASR任务的网络权重。在REVERB挑战数据集和VOiCES数据集上的ASR实验中,我们验证了语音去混响网络与端到端(E2E)ASR模型的联合学习相较于基于对数梅尔谱训练的基线ASR系统及其他去混响基准方法(平均相对提升10-24%),可显著提升性能。通过主观听音测试评估的语音质量改善结果,进一步凸显了重建音频的质量提升。