In a scenario with multiple persons talking simultaneously, the spatial characteristics of the signals are the most distinct feature for extracting the target signal. In this work, we develop a deep joint spatial-spectral non-linear filter that can be steered in an arbitrary target direction. For this we propose a simple and effective conditioning mechanism, which sets the initial state of the filter's recurrent layers based on the target direction. We show that this scheme is more effective than the baseline approach and increases the flexibility of the filter at no performance cost. The resulting spatially selective non-linear filters can also be used for speech separation of an arbitrary number of speakers and enable very accurate multi-speaker localization as we demonstrate in this paper.
翻译:在多人同时说话的场景中,信号的空间特性是提取目标信号最为显著的特征。本研究开发了一种深度联合空间-频谱非线性滤波器,该滤波器可导向任意目标方向。为此,我们提出了一种简洁高效的调节机制,该机制基于目标方向设定滤波器循环层的初始状态。实验表明,该方案比基线方法更为有效,且在不牺牲性能的前提下增加了滤波器的灵活性。所得的空间选择性非线性滤波器还可用于任意数量说话人的语音分离,并实现如本文所展示的极为精准的多说话人定位。