Target speaker extraction (TSE) aims to isolate individual speaker voices from complex speech environments. The effectiveness of TSE systems is often compromised when the speaker characteristics are similar to each other. Recent research has introduced curriculum learning (CL), in which TSE models are trained incrementally on speech samples of increasing complexity. In CL training, the model is first trained on samples with low speaker similarity between the target and interference speakers, and then on samples with high speaker similarity. To further improve CL, this paper uses a $k$-nearest neighbor-based voice conversion method to simulate and generate speech of diverse interference speakers, and then uses the generated data as part of the CL. Experiments demonstrate that training data based on synthetic speakers can effectively enhance the model's capabilities and significantly improve the performance of multiple TSE systems.
翻译:目标说话人提取旨在从复杂语音环境中分离出特定说话人的语音。当说话人特征彼此相似时,TSE系统的性能常会下降。近期研究引入了课程学习方法,使TSE模型在难度递增的语音样本上逐步训练。在CL训练中,模型首先在目标说话人与干扰说话人相似度较低的样本上训练,随后在相似度较高的样本上训练。为进一步提升CL效果,本文采用基于$k$近邻的语音转换方法模拟并生成多样化干扰说话人的语音,进而将生成数据纳入CL训练流程。实验表明,基于合成说话人的训练数据能有效增强模型能力,显著提升多种TSE系统的性能。