Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising

This paper describes speech enhancement for realtime automatic speech recognition (ASR) in real environments. A standard approach to this task is to use neural beamforming that can work efficiently in an online manner. It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming. The performance of such a supervised approach, however, is drastically degraded under mismatched conditions. This calls for run-time adaptation of the DNN. Although the ground-truth speech spectrogram required for adaptation is not available at run time, blind dereverberation and separation methods such as weighted prediction error (WPE) and fast multichannel nonnegative matrix factorization (FastMNMF) can be used for generating pseudo groundtruth data from a mixture. Based on this idea, a prior work proposed a dual-process system based on a cascade of WPE and minimum variance distortionless response (MVDR) beamforming asynchronously fine-tuned by block-online FastMNMF. To integrate the dereverberation capability into neural beamforming and make it fine-tunable at run time, we propose to use weighted power minimization distortionless response (WPD) beamforming, a unified version of WPE and minimum power distortionless response (MPDR), whose joint dereverberation and denoising filter is estimated using a DNN. We evaluated the impact of run-time adaptation under various conditions with different numbers of speakers, reverberation times, and signal-to-noise ratios (SNRs).

翻译：本文描述了真实环境中实时自动语音识别（ASR）的语音增强方法。该任务的标准方法是采用能够以在线方式高效工作的神经波束成形技术。该方法通过深度神经网络（DNN）从含噪声的回响混合语谱图中估计干净干语音的掩码，进而计算用于波束成形的增强滤波器。然而，这种监督式方法在失配条件下的性能会急剧下降，因此需要对DNN进行运行时自适应。尽管自适应所需的地面真实语音谱图在运行时无法获取，但加权预测误差（WPE）和快速多通道非负矩阵分解（FastMNMF）等盲去混响与分离方法可用于从混合信号中生成伪地面真实数据。基于这一思路，先前的研究提出了一种双处理系统，该系统基于WPE与最小方差无失真响应（MVDR）波束成形的级联结构，并通过块在线FastMNMF进行异步微调。为了将去混响能力整合到神经波束成形中并使其能够在运行时进行微调，我们提出采用加权功率最小化无失真响应（WPD）波束成形——这是WPE与最小功率无失真响应（MPDR）的统一形式，其联合去混响与降噪滤波器通过DNN进行估计。我们在不同说话者数量、混响时间和信噪比（SNR）的多种条件下评估了运行时自适应的效果。