Target speaker extraction (TSE) relies on a reference cue of the target to extract the target speech from a speech mixture. While a speaker embedding is commonly used as the reference cue, such embedding pre-trained with a large number of speakers may suffer from confusion of speaker identity. In this work, we propose a multi-level speaker representation approach, from raw features to neural embeddings, to serve as the speaker reference cue. We generate a spectral-level representation from the enrollment magnitude spectrogram as a raw, low-level feature, which significantly improves the model's generalization capability. Additionally, we propose a contextual embedding feature based on cross-attention mechanisms that integrate frame-level embeddings from a pre-trained speaker encoder. By incorporating speaker features across multiple levels, we significantly enhance the performance of the TSE model. Our approach achieves a 2.74 dB improvement and a 4.94% increase in extraction accuracy on Libri2mix test set over the baseline.
翻译:目标说话人提取(TSE)依赖于目标说话人的参考线索,以从语音混合中提取目标语音。虽然说话人嵌入通常被用作参考线索,但这种使用大量说话人预训练的嵌入可能存在说话人身份混淆的问题。在本工作中,我们提出了一种多层级说话人表征方法,从原始特征到神经嵌入,以作为说话人参考线索。我们从注册幅度谱图中生成一个频谱级表征,作为一种原始的、低层级的特征,这显著提高了模型的泛化能力。此外,我们提出了一种基于交叉注意力机制的上下文嵌入特征,该特征整合了来自预训练说话人编码器的帧级嵌入。通过融合多个层级的说话人特征,我们显著提升了TSE模型的性能。我们的方法在Libri2mix测试集上,相比基线模型,提取精度提高了2.74 dB,提取准确率提升了4.94%。