In this work, we address the problem of binaural target-speaker extraction in the presence of multiple simultane-ous talkers. We propose a novel approach that leverages the individual listener's Head-Related Transfer Function (HRTF) to isolate the target speaker. The proposed method is speaker-independent, as it does not rely on speaker embeddings. We employ a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier transform (STFT) of the mixed audio signals, and compare it to a Real-Imaginary (RI)-based neural network, demonstrating the advantages of the former. We first evaluate the method in an anechoic, noise-free scenario, achieving excellent extraction performance while preserving the binaural cues of the target signal. We then extend the evaluation to reverberant conditions. Our method proves robust, maintaining speech clarity and source directionality while simultaneously reducing reverberation. A comparative analysis with existing binaural Target Speaker Extraction (TSE) methods shows that the proposed approach achieves performance comparable to state-of-the-art techniques in terms of noise reduction and perceptual quality, while providing a clear advantage in preserving binaural cues. Demo-page: https://bi-ctse-hrtf.github.io
翻译:在本研究中,我们解决了存在多个同时说话人情况下的双耳目标说话人提取问题。我们提出了一种新颖的方法,该方法利用听者个体的头相关传输函数(HRTF)来隔离目标说话人。所提出的方法是说话人无关的,因为它不依赖于说话人嵌入。我们采用了一个完全复数值的神经网络,该网络直接在混合音频信号的复数值短时傅里叶变换(STFT)上操作,并将其与基于实部-虚部(RI)的神经网络进行比较,证明了前者的优势。我们首先在无回声、无噪声的场景中评估该方法,在保持目标信号双耳线索的同时,实现了优异的提取性能。随后,我们将评估扩展到混响条件下。我们的方法被证明是鲁棒的,在保持语音清晰度和声源方向性的同时,减少了混响。与现有的双耳目标说话人提取(TSE)方法进行的比较分析表明,所提出的方法在降噪和感知质量方面达到了与最先进技术相当的性能,同时在保持双耳线索方面具有明显优势。演示页面:https://bi-ctse-hrtf.github.io