Deep neural networks (DNNs) have greatly benefited direction of arrival (DoA) estimation methods for speech source localization in noisy environments. However, their localization accuracy is still far from satisfactory due to the vulnerability to nonspeech interference. To improve the robustness against interference, we propose a DNN based normalized time-frequency (T-F) weighted criterion which minimizes the distance between the candidate steering vectors and the filtered snapshots in the T-F domain. Our method requires no eigendecomposition and uses a simple normalization to prevent the optimization objective from being misled by noisy filtered snapshots. We also study different designs of T-F weights guided by a DNN. We find that duplicating the Hadamard product of speech ratio masks is highly effective and better than other techniques such as direct masking and taking the mean in the proposed approach. However, the best-performing design of T-F weights is criterion-dependent in general. Experiments show that the proposed method outperforms popular DNN based DoA estimation methods including widely used subspace methods in noisy and reverberant environments.
翻译:深度神经网络(DNN)在噪声环境下语音源定位的波达方向(DoA)估计方法中发挥了重要作用。然而,由于对非语音干扰的敏感性,其定位精度仍远未达到令人满意的水平。为提高对干扰的鲁棒性,我们提出一种基于DNN的归一化时频(T-F)加权准则,该准则在T-F域内最小化候选导向矢量与滤波快拍之间的距离。本方法无需特征分解,并采用简单的归一化操作来避免优化目标被含噪滤波快拍误导。我们还研究了由DNN引导的T-F权重的不同设计方案。实验发现,在提出的方法中,使用语音比率掩膜的哈达玛积副本比直接掩膜和取均值等其他技术更为有效。然而,T-F权重的最佳设计通常依赖于具体的准则。实验结果表明,在噪声和混响环境中,所提方法优于包括广泛使用的子空间方法在内的主流基于DNN的DoA估计方法。