In this work, we address the problem of binaural target-speaker extraction in the presence of multiple simultane-ous talkers. We propose a novel approach that leverages the individual listener's Head-Related Transfer Function (HRTF) to isolate the target speaker. The proposed method is speaker-independent, as it does not rely on speaker embeddings. We employ a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier transform (STFT) of the mixed audio signals, and compare it to a Real-Imaginary (RI)-based neural network, demonstrating the advantages of the former. We first evaluate the method in an anechoic, noise-free scenario, achieving excellent extraction performance while preserving the binaural cues of the target signal. We then extend the evaluation to reverberant conditions. Our method proves robust, maintaining speech clarity and source directionality while simultaneously reducing reverberation. A comparative analysis with existing binaural Target Speaker Extraction (TSE) methods shows that the proposed approach achieves performance comparable to state-of-the-art techniques in terms of noise reduction and perceptual quality, while providing a clear advantage in preserving binaural cues. Demo-page: https://bi-ctse-hrtf.github.io
翻译:本文研究多说话人同时存在场景下的双耳目标说话人提取问题。我们提出一种创新方法,利用听者个性化的头相关传递函数来分离目标说话人。该方法不依赖于说话人嵌入向量,因此具有说话人无关的特性。我们采用完全复数值神经网络直接处理混合音频信号的复数值短时傅里叶变换,并与基于实部-虚部表示的神经网络进行对比,证明了前者的优势。首先在无回声、无噪声场景中评估该方法,在保持目标信号双耳线索的同时获得了优异的提取性能。随后将评估扩展至混响环境。实验表明该方法具有强鲁棒性,在保持语音清晰度和声源方向性的同时有效抑制混响。与现有双耳目标说话人提取方法的对比分析显示,所提方法在降噪效果和感知质量方面达到当前先进技术水平,同时在保持双耳线索方面具有显著优势。演示页面:https://bi-ctse-hrtf.github.io