This paper introduces a multi-microphone method for extracting a desired speaker from a mixture involving multiple speakers and directional noise in a reverberant environment. In this work, we propose leveraging the instantaneous relative transfer function (RTF), estimated from a reference utterance recorded in the same position as the desired source. The effectiveness of the RTF-based spatial cue is compared with direction of arrival (DOA)-based spatial cue and the conventional spectral embedding. Experimental results in challenging acoustic scenarios demonstrate that using spatial cues yields better performance than the spectral-based cue and that the instantaneous RTF outperforms the DOA-based spatial cue.
翻译:本文提出一种多麦克风方法,用于在混响环境中从包含多名说话人和定向噪声的混合信号中提取目标说话人。本工作中,我们提出利用瞬时相对传递函数,该函数通过在与目标声源相同位置录制的参考语音进行估计。我们将基于RTF的空间线索与基于到达方向的空间线索及传统谱嵌入方法的效果进行了比较。在具有挑战性的声学场景中的实验结果表明,使用空间线索相比基于谱的线索能获得更好的性能,且瞬时RTF的表现优于基于DOA的空间线索。