The prevailing noise-resistant and reverberation-resistant localization algorithms primarily emphasize separating and providing directional output for each speaker in multi-speaker scenarios, without association with the identity of speakers.In this paper, we present a target speaker localization algorithm with a selective hearing mechanism. Given a reference speech of the target speaker, we first produce a speaker-dependent spectrogram mask to eliminate interfering speakers' speech. Subsequently, a Long short-term memory (LSTM) network is employed to extract the target speaker's location from the filtered spectrogram. Experiments validate the superiority of our proposed method over the existing algorithms for different scale invariant signal-to-noise ratios (SNR) conditions. Specifically, at SNR = -10 dB, our proposed network LocSelect achieves a mean absolute error (MAE) of 3.55 and an accuracy (ACC) of 87.40%.
翻译:摘要:现有主流抗噪声与抗混响定位算法在多说话人场景中主要侧重于分离并输出各说话人的方向信息,未与说话人身份建立关联。本文提出一种具有选择性听觉机制的目标说话人定位算法。首先,基于目标说话人的参考语音,生成说话人依赖的语谱图掩码以消除干扰语音;其次,采用长短期记忆网络(LSTM)从滤波后的语谱图中提取目标说话人的位置信息。实验证明,在不同尺度不变信噪比条件下,本方法均优于现有算法。特别地,在信噪比为-10 dB时,所提网络LocSelect的平均绝对误差为3.55,准确率达87.40%。