This paper presents a novel approach to target speaker extraction (TSE) using Curriculum Learning (CL) techniques, addressing the challenge of distinguishing a target speaker's voice from a mixture containing interfering speakers. For efficient training, we propose designing a curriculum that selects subsets of increasing complexity, such as increasing similarity between target and interfering speakers, and that selects training data strategically. Our CL strategies include both variants using predefined difficulty measures (e.g. gender, speaker similarity, and signal-to-distortion ratio) and ones using the TSE's standard objective function, each designed to expose the model gradually to more challenging scenarios. Comprehensive testing on the Libri2talker dataset demonstrated that our CL strategies for TSE improved the performance, and the results markedly exceeded baseline models without CL about 1 dB.
翻译:本文提出了一种利用课程学习技术进行目标说话人提取的新方法,以解决从包含干扰说话人的混合语音中区分目标说话人声音的挑战。为实现高效训练,我们提出设计一种课程,该课程选择复杂度递增的数据子集(例如增加目标说话人与干扰说话人之间的相似性),并策略性地选择训练数据。我们的课程学习策略包括使用预定义难度度量(如性别、说话人相似性和信噪比)的变体,以及使用目标说话人提取标准目标函数的变体,每种策略均旨在使模型逐步接触更具挑战性的场景。在Libri2talker数据集上的全面测试表明,我们的目标说话人提取课程学习策略提升了模型性能,其结果较未使用课程学习的基线模型显著提高了约1 dB。