Target speaker extraction (TSE) is essential in speech processing applications, particularly in scenarios with complex acoustic environments. Current TSE systems face challenges in limited data diversity and a lack of robustness in real-world conditions, primarily because they are trained on artificially mixed datasets with limited speaker variability and unrealistic noise profiles. To address these challenges, we propose Libri2Vox, a new dataset that combines clean target speech from the LibriTTS dataset with interference speech from the noisy VoxCeleb2 dataset, providing a large and diverse set of speakers under realistic noisy conditions. We also augment Libri2Vox with synthetic speakers generated using state-of-the-art speech generative models to enhance speaker diversity. Additionally, to further improve the effectiveness of incorporating synthetic data, curriculum learning is implemented to progressively train TSE models with increasing levels of difficulty. Extensive experiments across multiple TSE architectures reveal varying degrees of improvement, with SpeakerBeam demonstrating the most substantial gains: a 1.39 dB improvement in signal-to-distortion ratio (SDR) on the Libri2Talker test set compared to baseline training. Building upon these results, we further enhanced performance through our speaker similarity-based curriculum learning approach with the Conformer architecture, achieving an additional 0.78 dB improvement over conventional random sampling methods in which data samples are randomly selected from the entire dataset. These results demonstrate the complementary benefits of diverse real-world data, synthetic speaker augmentation, and structured training strategies in building robust TSE systems.
翻译:目标说话人提取在语音处理应用中至关重要,尤其是在复杂声学环境的场景中。当前的TSE系统面临数据多样性有限和真实条件下鲁棒性不足的挑战,这主要是由于它们通常在说话人变异性有限且噪声分布不真实的人工混合数据集上进行训练。为应对这些挑战,我们提出了Libri2Vox——一个新的数据集,它将来自LibriTTS数据集的纯净目标语音与来自含噪VoxCeleb2数据集的干扰语音相结合,提供了在真实噪声条件下大量且多样化的说话人集合。我们还利用最先进的语音生成模型生成的合成说话人对Libri2Vox进行了数据增强,以提升说话人多样性。此外,为了进一步提高引入合成数据的有效性,我们实施了课程学习策略,以逐步增加的难度水平来训练TSE模型。在多种TSE架构上进行的大量实验显示出不同程度的性能提升,其中SpeakerBeam取得了最显著的增益:与基线训练相比,在Libri2Talker测试集上的信号失真比提升了1.39 dB。基于这些结果,我们通过结合说话人相似度的课程学习方法与Conformer架构进一步提升了性能,相比传统的从整个数据集中随机选取样本的随机采样方法,实现了额外的0.78 dB提升。这些结果证明了多样化的真实世界数据、合成说话人增强以及结构化训练策略在构建鲁棒TSE系统中的互补优势。