Data augmentation (DA) has played a pivotal role in the success of deep speaker recognition. Current DA techniques primarily focus on speaker-preserving augmentation, which does not change the speaker trait of the speech and does not create new speakers. Recent research has shed light on the potential of speaker augmentation, which generates new speakers to enrich the training dataset. In this study, we delve into two speaker augmentation approaches: speed perturbation (SP) and vocal tract length perturbation (VTLP). Despite the empirical utilization of both methods, a comprehensive investigation into their efficacy is lacking. Our study, conducted using two public datasets, VoxCeleb and CN-Celeb, revealed that both SP and VTLP are proficient at generating new speakers, leading to significant performance improvements in speaker recognition. Furthermore, they exhibit distinct properties in sensitivity to perturbation factors and data complexity, hinting at the potential benefits of their fusion. Our research underscores the substantial potential of speaker augmentation, highlighting the importance of in-depth exploration and analysis.
翻译:数据增强在深度说话人识别的成功中发挥了关键作用。当前的数据增强技术主要聚焦于保持说话人身份的增强方法,这类方法不改变语音中的说话人特征,也不会产生新的说话人。近期研究揭示了说话人增强技术的潜力,该方法通过生成新的说话人来丰富训练数据集。在本研究中,我们深入探讨了两种说话人增强方法:速度扰动和声道长度扰动。尽管这两种方法已有经验性应用,但对其效能仍缺乏系统性研究。我们基于VoxCeleb和CN-Celeb两个公开数据集开展的实验表明,SP和VTLP均能有效生成新说话人,从而显著提升说话人识别性能。此外,两者在扰动因子敏感性和数据复杂度方面展现出不同特性,暗示其融合可能具有潜在优势。本研究证实了说话人增强技术的巨大潜力,并强调了深入探索与分析的重要性。