We propose a Beamformer-guided Target Speaker Extraction (BG-TSE) method to extract a target speaker's voice from a multi-channel recording informed by the direction of arrival of the target. The proposed method employs a front-end beamformer steered towards the target speaker to provide an auxiliary signal to a single-channel TSE system. By allowing for time-varying embeddings in the single-channel TSE block, the proposed method fully exploits the correspondence between the front-end beamformer output and the target speech in the microphone signal. Experimental evaluation on simulated multi-channel 2-speaker mixtures, in both anechoic and reverberant conditions, demonstrates the advantage of the proposed method compared to recent single-channel and multi-channel baselines.
翻译:我们提出了一种波束形成引导的目标说话人提取(BG-TSE)方法,该方法利用目标声源的到达方向,从多通道录音中提取目标说话人的语音。所提出方法采用指向目标说话人的前端波束形成器,为单通道TSE系统提供辅助信号。通过在单通道TSE模块中允许时变嵌入,所提出方法充分挖掘了前端波束形成器输出与麦克风信号中目标语音之间的对应关系。在模拟多通道双说话人混合信号(包括消声和混响条件)上的实验评估表明,与近期单通道和多通道基线方法相比,所提出方法具有优势。