The prevailing noise-resistant and reverberation-resistant localization algorithms primarily emphasize separating and providing directional output for each speaker in multi-speaker scenarios, without association with the identity of speakers. In this paper, we present a target speaker localization algorithm with a selective hearing mechanism. Given a reference speech of the target speaker, we first produce a speaker-dependent spectrogram mask to eliminate interfering speakers' speech. Subsequently, a Long short-term memory (LSTM) network is employed to extract the target speaker's location from the filtered spectrogram. Experiments validate the superiority of our proposed method over the existing algorithms for different scale invariant signal-to-noise ratios (SNR) conditions. Specifically, at SNR = -10 dB, our proposed network LocSelect achieves a mean absolute error (MAE) of 3.55 and an accuracy (ACC) of 87.40%.
翻译:摘要:现有的抗噪声和抗混响定位算法主要致力于在多说话人场景中分离每位说话人并为其提供定向输出,但未与说话人身份建立关联。本文提出了一种具有选择性听觉机制的靶向说话人定位算法。给定目标说话人的参考语音,我们首先生成说话人相关的语谱图掩码以消除干扰说话人的语音,随后采用长短期记忆(LSTM)网络从滤波后的语谱图中提取目标说话人的位置信息。实验验证了所提方法在不同尺度不变信噪比(SNR)条件下相较于现有算法的优越性。具体而言,在信噪比为 -10 dB 时,所提出的LocSelect网络实现了3.55的平均绝对误差(MAE)和87.40%的准确率(ACC)。