The remarkable ability of humans to selectively focus on a target speaker in cocktail party scenarios is facilitated by binaural audio processing. In this paper, we present a binaural time-domain Target Speaker Extraction model based on the Filter-and-Sum Network (FaSNet). Inspired by human selective hearing, our proposed model introduces target speaker embedding into separators using a multi-head attention-based selective attention block. We also compared two binaural interaction approaches -- the cosine similarity of time-domain signals and inter-channel correlation in learned spectral representations. Our experimental results show that our proposed model outperforms monaural configurations and state-of-the-art multi-channel target speaker extraction models, achieving best-in-class performance with 18.52 dB SI-SDR, 19.12 dB SDR, and 3.05 PESQ scores under anechoic two-speaker test configurations.
翻译:人类在鸡尾酒会场景中能够选择性地专注于目标说话人的卓越能力得益于双耳音频处理。本文提出了一种基于滤波求和网络的双耳时域目标说话人提取模型。受人类选择性听觉的启发,我们提出的模型通过基于多头注意力的选择性注意块将目标说话人嵌入引入分离器。我们还比较了两种双耳交互方法——时域信号的余弦相似性和学习到的频谱表示中的通道间相关性。实验结果表明,我们提出的模型优于单耳配置和最先进的多通道目标说话人提取模型,在无混响双说话人测试配置下取得了同类最佳的18.52 dB SI-SDR、19.12 dB SDR和3.05 PESQ分数。