In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally utilize the different spatial locations of the sources for a more powerful separation especially when the number of sources increases. To enhance the spatial processing in a multi-channel source separation scenario, in this work, we propose a deep neural network (DNN) based spatially selective filter (SSF) that can be spatially steered to extract the speaker of interest by initializing a recurrent neural network layer with the target direction. We compare the proposed SSF with a common end-to-end direct separation (DS) approach trained using utterance-wise permutation invariant training (PIT), which only implicitly learns to perform spatial filtering. We show that the SSF has a clear advantage over a DS approach with the same underlying network architecture when there are more than two speakers in the mixture, which can be attributed to a better use of the spatial information. Furthermore, we find that the SSF generalizes much better to additional noise sources that were not seen during training.
翻译:在多说话人的多信道分离任务中,我们旨在从混合信号中恢复所有独立的语音信号。与依赖于语音信号不同频谱-时间特性的单信道方法相比,多信道方法应额外利用声源的不同空间位置实现更强大的分离效果,尤其是在声源数量增加时。为增强多信道声源分离场景中的空间处理能力,本文提出一种基于深度神经网络(DNN)的空间选择性滤波器(SSF),该滤波器可通过用目标方向初始化循环神经网络层来空间导向提取目标说话人。我们将所提SSF与采用语句级排列不变性训练(PIT)的常见端到端直接分离(DS)方法进行对比,后者仅隐式学习空间滤波。实验表明,当混合信号中包含两个以上说话人时,SSF较具有相同底层网络架构的DS方法具有明显优势,这归因于对空间信息的更优利用。此外,我们发现SSF对训练中未出现的额外噪声源具有更强的泛化能力。