In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally utilize the different spatial locations of the sources for a more powerful separation especially when the number of sources increases. To enhance the spatial processing in a multi-channel source separation scenario, in this work, we propose a deep neural network (DNN) based spatially selective filter (SSF) that can be spatially steered to extract the speaker of interest by initializing a recurrent neural network layer with the target direction. We compare the proposed SSF with a common end-to-end direct separation (DS) approach trained using utterance-wise permutation invariant training (PIT), which only implicitly learns to perform spatial filtering. We show that the SSF has a clear advantage over a DS approach with the same underlying network architecture when there are more than two speakers in the mixture, which can be attributed to a better use of the spatial information. Furthermore, we find that the SSF generalizes much better to additional noise sources that were not seen during training and to scenarios with speakers positioned at a similar angle.
翻译:在多说话人多通道分离任务中,我们旨在从混合信号中恢复所有独立的语音信号。与依赖语音信号不同声谱-时域特性的单通道方法相比,多通道方法应额外利用声源的不同空间位置实现更强大的分离效果,尤其当声源数量增加时。为增强多通道源分离场景中的空间处理能力,本文提出一种基于深度神经网络的空间选择性滤波器(SSF),该滤波器可通过用目标方向初始化循环神经网络层,实现空间导向以提取目标说话人。我们将所提SSF与采用语句级置换不变训练(PIT)的常见端到端直接分离(DS)方法进行对比,后者仅隐式学习执行空间滤波。实验表明,当混合信号中存在超过两个说话人时,SSF相较于具有相同底层网络架构的DS方法具有显著优势,这归因于对空间信息的更有效利用。此外,我们发现SSF对训练中未见的额外噪声源以及说话人角度相近的场景具有更强的泛化能力。