Multi-channel speech separation using speaker's directional information has demonstrated significant gains over blind speech separation. However, it has two limitations. First, substantial performance degradation is observed when the coming directions of two sounds are close. Second, the result highly relies on the precise estimation of the speaker's direction. To overcome these issues, this paper proposes 3D features and an associated 3D neural beamformer for multi-channel speech separation. Previous works in this area are extended in two important directions. First, the traditional 1D directional beam patterns are generalized to 3D. This enables the model to extract speech from any target region in the 3D space. Thus, speakers with similar directions but different elevations or distances become separable. Second, to handle the speaker location uncertainty, previously proposed spatial feature is extended to a new 3D region feature. The proposed 3D region feature and 3D neural beamformer are evaluated under an in-car scenario. Experimental results demonstrated that the combination of 3D feature and 3D beamformer can achieve comparable performance to the separation model with ground truth speaker location as input.
翻译:利用说话人方向信息的多通道语音分离,相较于盲语音分离展现出了显著优势。然而,该方法存在两个局限性:首先,当两个声源的来向接近时,性能会大幅下降;其次,分离结果高度依赖于说话人方向的精确估计。为解决这些问题,本文提出了一种3D特征及其关联的3D神经波束成形器,用于多通道语音分离。本研究在先前工作基础上针对两个关键方向进行了扩展:第一,将传统的1D方向波束模式推广至3D,使模型能够从三维空间中的任意目标区域提取语音,从而分离方向相似但仰角或距离不同的说话人;第二,为应对说话人位置不确定性,将先前提出的空间特征扩展为新的3D区域特征。在车载场景下对所提出的3D区域特征与3D神经波束成形器进行了评估,实验结果表明,3D特征与3D波束成形器的组合能够取得与以真实说话人位置作为输入的分离模型相当的性能。