In this work, we propose a frequency bin-wise method to estimate the single-channel speech presence probability (SPP) with multiple deep neural networks (DNNs) in the short-time Fourier transform domain. Since all frequency bins are typically considered simultaneously as input features for conventional DNN-based SPP estimators, high model complexity is inevitable. To reduce the model complexity and the requirements on the training data, we take a single frequency bin and some of its neighboring frequency bins into account to train separate gate recurrent units. In addition, the noisy speech and the a posteriori probability SPP representation are used to train our model. The experiments were performed on the Deep Noise Suppression challenge dataset. The experimental results show that the speech detection accuracy can be improved when we employ the frequency bin-wise model. Finally, we also demonstrate that our proposed method outperforms most of the state-of-the-art SPP estimation methods in terms of speech detection accuracy and model complexity.
翻译:本文提出一种在短时傅里叶变换域内,利用多个深度神经网络对单通道语音存在概率进行频率点独立估计的方法。传统基于深度神经网络的语音存在概率估计器通常将所有频率点同时作为输入特征,导致模型复杂度较高。为降低模型复杂度及对训练数据的要求,我们针对单个频率点及其邻近频率点训练独立的门控循环单元。此外,模型训练采用带噪语音和后验概率语音存在概率表征。实验基于深度噪声抑制挑战数据集进行。结果表明,采用频率点独立模型可提升语音检测精度。最后,我们还证明所提方法在语音检测精度与模型复杂度方面优于多数现有最优语音存在概率估计方法。